SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Образец заголовка
Tutorial on using and
learning phrases from text
by Cassandra Jacobs
Prepared as an assignment for CS410: Text Information Systems in Spring 2016
Образец заголовкаRoadmap
•  What are phrases?
•  Why use phrases?
•  What NLP tasks do phrases help?
•  How do we mine phrases?
Образец заголовкаWhat are phrases?
•  Word combinations
•  Literal and idiomatic meanings
– “kick the bucket” – to die
– “strong coffee” – highly caffeinated,
concentrated
– “data mining” – a particular concept in
computer science
Образец заголовкаWhy phrases?
•  Phrases can express ideas not obvious
from the individual words
– White House (an important building)
– red herring (an anomaly)
– syntactic parsing (a paper topic)
•  Can disambiguate words “for free”
– (river) bank versus (financial) bank
Образец заголовкаPhrases versus words
•  Difficult to extract from text
•  n words, but n2 possible bigrams, n3
trigrams, etc.
– Always rarer than individual words
– Simple measures like frequency can lead to
bad phrases (e.g. “in the”, “is a”, “not our”)
Образец заголовкаPhrases versus words
•  Some probabilistic measurements are
good proxies for “phraseness”
•  Mutual information identifies phrases that
occur more often than chance:
p(a,b)
p(a)p(b)
Образец заголовкаPhrases versus words
•  Unsupervised methods like topic models
of bigrams often provide strange results
– “I mean”
– “Well I”
•  Distributional similarity/vector methods
require supervision or feedback about
phrase quality
Образец заголовкаPhrases versus words
•  Low numbers of observations
– Huge domain differences in whether phrases
are used
•  E.g. ACL submissions encouraged to not use
idiomatic expressions
– Formal versus informal contexts
– Difference between writers’ language
backgrounds
Образец заголовкаTasks where phrases are useful
•  Good phrases should improve or reflect
–  Document classification tasks
–  External knowledge (Wikipedia titles, dictionary)
–  Analogy solving
–  Paraphrase identification
–  Similarity ratings on Amazon Mechanical Turk
–  Machine translation
Образец заголовка
Task 1: Named entity
recognition
•  Some studies use wiki phrases (headlines)
by taking all the titles and using them in
other tasks
•  Can parse a sentence for entities by
automatically labeling some of the entities
that are in Wikipedia
Образец заголовка
Identifying wiki phrases for
named entity recognition
•  Polls show DemocratORG
Hillary_ClintonPER and RepublicanORG
Donald_TrumpPER ahead by double-digit
margins
•  Wiki phrases like Hillary_Clinton and
Donald_Trump contain lots of clues that
they are people
Образец заголовка
Identifying wiki phrases for
named entity recognition
•  Passos, Kumar, & McCallum (2014)
– Bigrams where p(a,b)/(p(a)p(b)) > 1000
– Then top 1M phrases
– Create embeddings from these phrases
– Embeddings used as features in named entity
recognition (NER)
– Using phrase embeddings led to state of the
art NER
Образец заголовка
Task 2: Using idioms in
sentiment analysis
•  Bag of individual words models would
probably misclassify these two
– “not that bad” à ok
– “not that good” à probably bad
•  Sometimes adding in phrase information
increases noise, runtime
Образец заголовка
Using idioms in sentiment
analysis
•  Williams et al. (2015) annotated idioms in
context as either positive or negative
– 580 idioms from a language learner textbook
– Regular expressions to identify variants
– “Not that bad” -> neutral
– “A drop in the bucket” -> good
•  Sentiment classification increased from 45
to 60% with addition of idioms
Образец заголовка
Task 3: Using idioms in phrase
analogies
Toronto: Toronto Mapleleafs ::
Montreal: Montreal Canadiens
– Want to produce complex, non-word output in
an analogy task
Образец заголовка
Using idioms in phrase
analogies
•  Mikolov et al. (2013)
•  In an analogy task, need to first identify
phrases
– High mutual information score cutoff for
phrase learning
– Train a neural network model to learn
distributed phrase vector representations
Образец заголовка
Using idioms in phrase
analogies
•  Neural network representations are pairs
of words that are concatenated
– “Toronto Mapleleafs” is treated like a single
word for the model
– Model predicts the contexts given words and
phrases as input
– “Toronto Mapleleafs” and “Montreal
Canadiens” both predict a “hockey” context
when the individual words do not
Образец заголовкаHow to learn phrases?
•  Unsupervised methods
•  Supervised methods
Образец заголовка
Unsupervised learning of
phrases
•  Some papers focus on how to get good
phrases beyond mutual information
measures
– Shallow parsing with structural constraints (no
“of the United”)
– If a phrase includes another phrase, the whole
phrase must be included (“President of the
United States”)
Образец заголовка
Unsupervised learning of
phrases
•  Cho et al. (2014) propose a model for
machine translation that predicts words
and phrases in a target language
(recursive neural network)
– Input: Word and next word in source language
– Output: Word and next word in target
language
Образец заголовка
Unsupervised learning of
phrases
•  Predicting the next word of a word in a
foreign language helps the model
associate the past with potential future
output
– Phrases learned in the Cho et al. (2014)
model cluster “one to three months” near “for
two months”
Образец заголовкаSupervised learning of phrases
•  Liu et al. (2015) define quality as a
threshold with two properties
– Informativeness within a document (effectively
term frequency/inverse document frequency)
– Concordance (conventionality, judged by
difference between some combinations – e.g.
powerful coffee, strong coffee)
– Like TF-IDF for phrases
Образец заголовкаEvaluation of learned phrases
•  Perplexity of the data given the model
– Higher perplexity means less data explained
– When a model captures more dependencies
in the data, phrases included are good (El-Kishky
et al., 2015)
– This metric works better for some domains
than others (e.g. Yelp)
Образец заголовкаEvaluations of phrases
•  El-Kishky et al. (2015) also compared
retrieved phrases against Wikipedia titles
– If in Wikipedia, then this is a very good phrase
– If not, harder to evaluate
– Works for some domains but maybe not
others (e.g. abstracts and papers)
Образец заголовкаCurrent state of research
•  No gold standard for evaluating whether a
phrase is good or not
– Many available datasets and applications
– Less clear how to learn phrases in an
unsupervised framework
– Many models implicitly or explicitly use mutual
information and background language models
as filters
Образец заголовкаReferences
Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, Jiawei Han, "Scalable Topical
Phrase Mining from Text Corpora", PVLDB Vol. 8 (Also, Proc. 2015 Int. Conf. on Very Large
Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015).
Liu, J., Shang, J., Wang, C., Ren, X., & Han, J. (2015, May). Mining quality phrases from
massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on
Management of Data (pp. 1729-1744). ACM.
Passos, A., Kumar, V., & McCallum, A. (2014). Lexicon infused phrase embeddings for named
entity resolution. arXiv preprint arXiv:1404.5367.
Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., & Spasić, I. (2015). The role of idioms
in sentiment analysis. Expert Systems with Applications, 42, 7375-7385.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed
representations of words and phrases and their compositionality. In Advances in neural
information processing systems (pp. 3111-3119).
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078.

Mais conteúdo relacionado

Mais procurados

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
gu wendong
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
Dustin Smith
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
George Ang
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
harit66
 

Mais procurados (20)

Email Classification
Email ClassificationEmail Classification
Email Classification
 
Tag based recommender system
Tag based recommender systemTag based recommender system
Tag based recommender system
 
Final deck
Final deckFinal deck
Final deck
 
Distributed Processing of Stream Text Mining
Distributed Processing of Stream Text MiningDistributed Processing of Stream Text Mining
Distributed Processing of Stream Text Mining
 
Tutorial on Relationship Mining In Online Social Networks
Tutorial on Relationship Mining In Online Social NetworksTutorial on Relationship Mining In Online Social Networks
Tutorial on Relationship Mining In Online Social Networks
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
 
Story generation-Sarah Saneei
Story generation-Sarah SaneeiStory generation-Sarah Saneei
Story generation-Sarah Saneei
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Preference Elicitation in Recommender Systems
Preference Elicitation in Recommender SystemsPreference Elicitation in Recommender Systems
Preference Elicitation in Recommender Systems
 
Tag And Tag Based Recommender
Tag And Tag Based RecommenderTag And Tag Based Recommender
Tag And Tag Based Recommender
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Word vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlmWord vectorization(embedding) with nnlm
Word vectorization(embedding) with nnlm
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Replicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender SystemsReplicable Evaluation of Recommender Systems
Replicable Evaluation of Recommender Systems
 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineering
 
Models for Information Retrieval and Recommendation
Models for Information Retrieval and RecommendationModels for Information Retrieval and Recommendation
Models for Information Retrieval and Recommendation
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 
Abstractive Review Summarization
Abstractive Review SummarizationAbstractive Review Summarization
Abstractive Review Summarization
 
Probabilistic Information Retrieval
Probabilistic Information RetrievalProbabilistic Information Retrieval
Probabilistic Information Retrieval
 

Semelhante a Using and learning phrases

Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databases
Rafael Alvarado
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
Kalpit Desai
 

Semelhante a Using and learning phrases (20)

semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspective
 
20181106 survey on challenges of question answering in the semantic web saltlux
20181106 survey on challenges of question answering in the semantic web saltlux20181106 survey on challenges of question answering in the semantic web saltlux
20181106 survey on challenges of question answering in the semantic web saltlux
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Mdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databasesMdst3705 2013-02-05-databases
Mdst3705 2013-02-05-databases
 
1910 HCLT
1910 HCLT1910 HCLT
1910 HCLT
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...Temple University Digital Scholarship Center: Model of the Month Club: Septem...
Temple University Digital Scholarship Center: Model of the Month Club: Septem...
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 

Using and learning phrases

  • 1. Образец заголовка Tutorial on using and learning phrases from text by Cassandra Jacobs Prepared as an assignment for CS410: Text Information Systems in Spring 2016
  • 2. Образец заголовкаRoadmap •  What are phrases? •  Why use phrases? •  What NLP tasks do phrases help? •  How do we mine phrases?
  • 3. Образец заголовкаWhat are phrases? •  Word combinations •  Literal and idiomatic meanings – “kick the bucket” – to die – “strong coffee” – highly caffeinated, concentrated – “data mining” – a particular concept in computer science
  • 4. Образец заголовкаWhy phrases? •  Phrases can express ideas not obvious from the individual words – White House (an important building) – red herring (an anomaly) – syntactic parsing (a paper topic) •  Can disambiguate words “for free” – (river) bank versus (financial) bank
  • 5. Образец заголовкаPhrases versus words •  Difficult to extract from text •  n words, but n2 possible bigrams, n3 trigrams, etc. – Always rarer than individual words – Simple measures like frequency can lead to bad phrases (e.g. “in the”, “is a”, “not our”)
  • 6. Образец заголовкаPhrases versus words •  Some probabilistic measurements are good proxies for “phraseness” •  Mutual information identifies phrases that occur more often than chance: p(a,b) p(a)p(b)
  • 7. Образец заголовкаPhrases versus words •  Unsupervised methods like topic models of bigrams often provide strange results – “I mean” – “Well I” •  Distributional similarity/vector methods require supervision or feedback about phrase quality
  • 8. Образец заголовкаPhrases versus words •  Low numbers of observations – Huge domain differences in whether phrases are used •  E.g. ACL submissions encouraged to not use idiomatic expressions – Formal versus informal contexts – Difference between writers’ language backgrounds
  • 9. Образец заголовкаTasks where phrases are useful •  Good phrases should improve or reflect –  Document classification tasks –  External knowledge (Wikipedia titles, dictionary) –  Analogy solving –  Paraphrase identification –  Similarity ratings on Amazon Mechanical Turk –  Machine translation
  • 10. Образец заголовка Task 1: Named entity recognition •  Some studies use wiki phrases (headlines) by taking all the titles and using them in other tasks •  Can parse a sentence for entities by automatically labeling some of the entities that are in Wikipedia
  • 11. Образец заголовка Identifying wiki phrases for named entity recognition •  Polls show DemocratORG Hillary_ClintonPER and RepublicanORG Donald_TrumpPER ahead by double-digit margins •  Wiki phrases like Hillary_Clinton and Donald_Trump contain lots of clues that they are people
  • 12. Образец заголовка Identifying wiki phrases for named entity recognition •  Passos, Kumar, & McCallum (2014) – Bigrams where p(a,b)/(p(a)p(b)) > 1000 – Then top 1M phrases – Create embeddings from these phrases – Embeddings used as features in named entity recognition (NER) – Using phrase embeddings led to state of the art NER
  • 13. Образец заголовка Task 2: Using idioms in sentiment analysis •  Bag of individual words models would probably misclassify these two – “not that bad” à ok – “not that good” à probably bad •  Sometimes adding in phrase information increases noise, runtime
  • 14. Образец заголовка Using idioms in sentiment analysis •  Williams et al. (2015) annotated idioms in context as either positive or negative – 580 idioms from a language learner textbook – Regular expressions to identify variants – “Not that bad” -> neutral – “A drop in the bucket” -> good •  Sentiment classification increased from 45 to 60% with addition of idioms
  • 15. Образец заголовка Task 3: Using idioms in phrase analogies Toronto: Toronto Mapleleafs :: Montreal: Montreal Canadiens – Want to produce complex, non-word output in an analogy task
  • 16. Образец заголовка Using idioms in phrase analogies •  Mikolov et al. (2013) •  In an analogy task, need to first identify phrases – High mutual information score cutoff for phrase learning – Train a neural network model to learn distributed phrase vector representations
  • 17. Образец заголовка Using idioms in phrase analogies •  Neural network representations are pairs of words that are concatenated – “Toronto Mapleleafs” is treated like a single word for the model – Model predicts the contexts given words and phrases as input – “Toronto Mapleleafs” and “Montreal Canadiens” both predict a “hockey” context when the individual words do not
  • 18. Образец заголовкаHow to learn phrases? •  Unsupervised methods •  Supervised methods
  • 19. Образец заголовка Unsupervised learning of phrases •  Some papers focus on how to get good phrases beyond mutual information measures – Shallow parsing with structural constraints (no “of the United”) – If a phrase includes another phrase, the whole phrase must be included (“President of the United States”)
  • 20. Образец заголовка Unsupervised learning of phrases •  Cho et al. (2014) propose a model for machine translation that predicts words and phrases in a target language (recursive neural network) – Input: Word and next word in source language – Output: Word and next word in target language
  • 21. Образец заголовка Unsupervised learning of phrases •  Predicting the next word of a word in a foreign language helps the model associate the past with potential future output – Phrases learned in the Cho et al. (2014) model cluster “one to three months” near “for two months”
  • 22. Образец заголовкаSupervised learning of phrases •  Liu et al. (2015) define quality as a threshold with two properties – Informativeness within a document (effectively term frequency/inverse document frequency) – Concordance (conventionality, judged by difference between some combinations – e.g. powerful coffee, strong coffee) – Like TF-IDF for phrases
  • 23. Образец заголовкаEvaluation of learned phrases •  Perplexity of the data given the model – Higher perplexity means less data explained – When a model captures more dependencies in the data, phrases included are good (El-Kishky et al., 2015) – This metric works better for some domains than others (e.g. Yelp)
  • 24. Образец заголовкаEvaluations of phrases •  El-Kishky et al. (2015) also compared retrieved phrases against Wikipedia titles – If in Wikipedia, then this is a very good phrase – If not, harder to evaluate – Works for some domains but maybe not others (e.g. abstracts and papers)
  • 25. Образец заголовкаCurrent state of research •  No gold standard for evaluating whether a phrase is good or not – Many available datasets and applications – Less clear how to learn phrases in an unsupervised framework – Many models implicitly or explicitly use mutual information and background language models as filters
  • 26. Образец заголовкаReferences Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", PVLDB Vol. 8 (Also, Proc. 2015 Int. Conf. on Very Large Data Bases (VLDB'15), Kohala Coast, Hawaii, Sept. 2015). Liu, J., Shang, J., Wang, C., Ren, X., & Han, J. (2015, May). Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM. Passos, A., Kumar, V., & McCallum, A. (2014). Lexicon infused phrase embeddings for named entity resolution. arXiv preprint arXiv:1404.5367. Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., & Spasić, I. (2015). The role of idioms in sentiment analysis. Expert Systems with Applications, 42, 7375-7385. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.