SlideShare uma empresa Scribd logo
1 de 41
Visualizing Words and Topics with
Scattertext
Jason S. Kessler*
June 14, 2018
Code for all visualizations is available at:
https://github.com/JasonKessler/PyDataSeattle2018
$ pip3 install scattertext
@jasonkessler*No, not that Jason Kessler
Lexicon speculation
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification
using machine learning techniques. EMNLP. 2002. (ACL 2018 Test of Time Award Winner)
@jasonkessler
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification
using machine learning techniques. EMNLP. 2002.
Lexicon mining ≈ lexicon speculation
@jasonkessler
One is made from positive reviews, the other
negative reviews (Mueller’s wordcloud)
@jasonkessler
Motivation
• What language can be used to better market a product?
• Words characteristic of effective (vs ineffective) marketing messages
• Uncover framing of political issues
• How do Republicans and Democratic politicians talk differently about
abortion?
• Cultural anthropology
• What topics or language are characteristic to groups of people
• Psycholinguistics
• How language use is associated personality and other personal characteritics
• Writing better headlines
Language and Demographics
Christian Rudder: http://blog.okcupid.com/index.php/page/7/
hobos
almond
butter 100 Years of
Solitude
Bikram yoga
@jasonkessler
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010)
OKCupid: Words and phrases that distinguish white
men.
@jasonkessler
Explanation
OKCupid: Words and phrases that
distinguish Latin men.
Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) @jasonkessler
Ranking with everyone else
The smaller the distance from the top left, the
higher the association with white men
Source: Christian Rudder. Dataclysm. 2014.
Phish is highly associated with white men
Kpop is not
@jasonkessler
@jasonkessler
my blue eyes
Source: Christian Rudder. Dataclysm. 2014.
Scattertext
pip install scattertext
github.com/JasonKessler/scattertext
@jasonkessler
Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017.
- Interactive, d3-
based scatterplot
- Concise Python
API
- Automatically
displays non-
overlapping labels
Scaled F-Score
• Term-class* associations:
• “Good” is associated with the “positive” class
• “Bad” with the class “negative”
• Core intuition: association relies on two necessary factors
• Frequency: How often a term occurs in a class
• Precision: P(class|document contains term)
• F-Score:
• Information retrieval evaluation metric
• Harmonic mean between precision and recall
• Requires both metrics to be high
• *Term: is defined to be a word, phrase or other discrete linguistic element
@jasonkessler
@jasonkessler
Precision
Frequency
Naïve approach
@jasonkessler
Precision
Frequency
Naïve approach
Y-axis: - Precision, i.e. P(class|term), roughly normal distribution
- Mean ≅ 0.5, sd ≅ 0.4,
X-axis: - Frequency, i.e. P(term|class), roughly power distribution
- Mean ≅ 0.00008, sd ≅ 0.008
Color: - Harmonic mean of precision and frequency (blue=high, red=low)
@jasonkessler
Precision
Frequency
Problem:
• Top words are just stop words.
• Why?
• Harmonic mean of uneven distributions.
• Most words have prec of ~0.5, leads harmonic
mean to rely on frequency.
Fix: Normalize Precision and Frequency
• Task: make precision
and frequency
similarly distributed
• How: take normal CDF
of each term’s
precision and
frequency
• Mean and std.
computed from data
• Right: log normal CDF
@jasonkessler
This area is the log-normal
CDF of the term “beauty”
(0.938 ∈ [0,1]).
Each tick mark is the log-
frequency of a term.
*log-normal CDF isn’t used in these charts
Scaled-F-Score
@jasonkessler
NormCDF-Precision
NormCDF - Frequency
Positive Scaled-F-Score
Good: positive terms
make sense!
Still some function words,
but that’s okay.
Note These frequent
terms are all very close
to 1 on the x-axis, but
are ordered
NormCDF-Precision
NormCDF - Frequency @jasonkessler
Pos
Freq.
Neg
Freq. Prec.
Freq
%
Raw
Hmean
Prec.
CDF
Freq.
CDF
Scaled
F-Score
best 108 36 75.00% 0.22% 0.44% 71.95% 99.50% 83.51%
entertaining 58 13 81.69% 0.12% 0.24% 77.07% 90.94% 83.43%
fun 73 26 73.74% 0.15% 0.30% 70.92% 95.63% 81.44%
heart 45 11 80.36% 0.09% 0.18% 76.09% 84.49% 80.07%
Top Scaled-F Score Terms
Note: normalized precision and frequency are on comparable
scales, allowing for the harmonic mean to take both into account.
Problem: highly negative terms are
all low frequency
NormCDF-Precision
NormCDF - Frequency @jasonkessler
Solution:
• Compute Scaled F-
Score association
scores for negative
reviews.
• Use the highest score
Scaled-F-Score
Positive Scaled F-Score
@jasonkessler
Negative Scaled F-
Score
Note: only one
obviously negative
term
@jasonkessler
Scaled-F-Score by log-
frequency
The score can be overly
sensitive to very frequent
terms, but still doesn’t score
them very highly
ScaledF-Score
Log Frequency
This chart over-emphasizes
stop words, and has a lot of
white space
Characteristic Term Detection
• General idea
• Characteristic terms are more likely to occur in corpus than in general English
• “Normal” English:
• Peter Norvig’s list of word frequencies from web in late ‘00s.
• Algorithm:
• Dense rank terms that appear in corpus by their corpus frequencies and their
“standard” English frequencies.
• Scale term ranks by the number of distinct ranks in the corpus (fewer) or
background (greater)
• Take rank difference
@jasonkessler
• Left: most frequent
terms in corpus
• Positive rank difference
between film and movie@jasonkessler
Top 10 Characteristic Words by Rank
@jasonkessler
Y-axis: Scaled F-Score
X-axis: Characteristic Rank Delta
Some non-movie-like words do affect
sentiment
@jasonkessler
Why not use TF-IDF?
• Drastically favors low frequency
terms
• term in all classes -> idf=1 ->
score=0
TF-IDF(Positive)-TF-IDF(Negative)
Log Frequency
TF IDF
@jasonkessler
Burt Monroe, Michael Colaresi and Kevin Quinn. Fightin'
words: Lexical feature selection and evaluation for
identifying the content of political conflict. Political
Analysis. 2008.@jasonkessler
Monroe et. al (2009) approach
• Bayesian approach to term-
association
• Likelihood: Z-score of log-odds-
ratio
• Prior: Term frequency in a
background corpus
• Posterior: Z-score of log-odds-ratio
with background counts as
smoothing values
Popular, but much more tweaking to
get to work than Scaled F Score.
@jasonkessler
Scattertext reimplementation of Monroe et al. See
http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb for code.
Scattertext
implementation (with prior
weighting modifications)
In defense of stop words
Cindy K. Chung and James W. Pennebaker. Counting
Little Words in Big Data: The Psychology of
Communities, Culture, and History. EASP. 2012
In times of
shared crisis,
“we” use
increases, while
“I” use decreases.
I/we: age, social
integration
I: lying, social
rank
@jasonkessler
Function words and gender
Newman, ML; Groom, CJ; Handelman LD, Pennebaker, JW. Gender
Differences in Language Use: An Analysis of 14,000 Text Samples. 2008.
LIWC Dimension
Bold: entirely stop words
Effect Size (Cohen’s d)
(>0 F, <0 M) MANOVA p<.001
All Pronouns (esp. 3rd person) 0.36
Present tense verbs (walk, is, be) 0.18
Feeling (touch, hold, feel) 0.17
Certainty (always, never) 0.14
Word count NS
Numbers -0.15
Prepositions -0.17
Words >6 letters -0.24
Swear words -0.22
Articles -0.24
• Performed on a
variety of
language
categories,
including
speech.
• Other studies
have found that
function words
are the best
predictors of
gender.
@jasonkessler
Clickbait: what works?
@jasonkessler
Clickbait corpus
• Facebook posts from BuzzFeed, NY Times, etc/ from 2010s.
• Includes headline and the number of Facebook likes
• Scraped by researcher Max Woolf at github.com/minimaxir/clickbait-cluster.
• We’ll separate articles from 2016 into the upper third and lower third
of likes.
• Identify words and phrases that predict likes.
• Begin with noun phrases identified from Phrase Machine (Handler et
al. 2016)
• Filter out redundant NPs.
Abram Handler, Matt Denny, Hanna Wallach, and Brendan O'Connor. Bag of what? Simple noun
phrase extraction for corpus analysis. NLP+CSS Workshop at EMNLP 2016.
@jasonkessler
@jasonkessler
Scaled-F-Score of engagement by
noun phrase
@jasonkessler
Scaled-F-Score of engagement by
unigram
Psycholinguist information:
3rd person pronouns -> high
engagement (indicative of female)
2nd person low (male)
“dies”: obit
Can, guess, how: questions.
Clickbait corpus
• How do terms with similar meanings differ in terms of their
engagement rates?
• Use Gensim (https://radimrehurek.com/gensim/) to find word embeddings
• Use UMAP (McInnes and Healy 2018) to project them into two
dimensions, and explore them with Scattertext.
• Locally groups words with similar embeddings together.
• Better alternative to T-SNE; allows for cosine instead of Euclidean distance
criteria
Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for
Dimension Reduction. Arxiv. 2018.
@jasonkessler
@jasonkessler
This island is mostly
food related.
“Chocolate” and
“cake” are highly
engaging, but
“breakfast” has
predictive of low
engagement.
Term positions from determined by UMAP,
color by Scaled F-Score for engagement.
Clickbait corpus
• How do the Times and Buzzfeed differ in what they talk about, and
their content engages their readers?
• Scattertext can easily create visualizations to help answer these
questions.
• First, we’ll look at how what engages for Buzzfeed contrasts with what
engages for the Times, and vice versa
Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for
Dimension Reduction. Arxiv. 2018.
@jasonkessler
Oddly, NY Times readers distinctly
like articles about sex, death, and
which are written in a smug tone.
This chart doesn’t give a good sense
of what language is more associated
with one site.
@jasonkessler
This chart let’s you know how Buzzfeed and the
Times are distinct, while still distinguishing
engaging content,
@jasonkessler
Thank you! Questions?
@jasonkessler
Jason S. Kessler
Global AI Conference
April 27, 2018
https://github.com/JasonKessler/GlobalAI2018

Mais conteúdo relacionado

Mais procurados

J2 ee container & components
J2 ee container & componentsJ2 ee container & components
J2 ee container & components
Keshab Nath
 
Distributed web based systems
Distributed web based systemsDistributed web based systems
Distributed web based systems
Reza Gh
 

Mais procurados (20)

JMS-Java Message Service
JMS-Java Message ServiceJMS-Java Message Service
JMS-Java Message Service
 
C Building Blocks
C Building Blocks C Building Blocks
C Building Blocks
 
Distributed System ppt
Distributed System pptDistributed System ppt
Distributed System ppt
 
COCOMO MODEL 1 And 2
COCOMO MODEL 1 And 2COCOMO MODEL 1 And 2
COCOMO MODEL 1 And 2
 
Introduction to AWS Services and Cloud Computing
Introduction to AWS Services and Cloud ComputingIntroduction to AWS Services and Cloud Computing
Introduction to AWS Services and Cloud Computing
 
Cloud Computing and Service oriented Architecture
Cloud Computing and Service oriented Architecture Cloud Computing and Service oriented Architecture
Cloud Computing and Service oriented Architecture
 
J2 ee container & components
J2 ee container & componentsJ2 ee container & components
J2 ee container & components
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
 
Designing applications with multimedia capabilities
Designing applications with multimedia capabilitiesDesigning applications with multimedia capabilities
Designing applications with multimedia capabilities
 
Components of .NET Framework
Components of .NET FrameworkComponents of .NET Framework
Components of .NET Framework
 
SAX
SAXSAX
SAX
 
Project on disease prediction
Project on disease predictionProject on disease prediction
Project on disease prediction
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
 
OCI_Icons.pptx
OCI_Icons.pptxOCI_Icons.pptx
OCI_Icons.pptx
 
Cloud Computing - Introduction
Cloud Computing - IntroductionCloud Computing - Introduction
Cloud Computing - Introduction
 
Ecg analysis in the cloud
Ecg analysis in the cloudEcg analysis in the cloud
Ecg analysis in the cloud
 
Task programming
Task programmingTask programming
Task programming
 
Seven step model of migration into the cloud
Seven step model of migration into the cloudSeven step model of migration into the cloud
Seven step model of migration into the cloud
 
Distributed web based systems
Distributed web based systemsDistributed web based systems
Distributed web based systems
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 

Semelhante a Visualizing Words and Topics with Scattertext

Ecw 21st century skills
Ecw 21st century skillsEcw 21st century skills
Ecw 21st century skills
Steve Woods
 
TOOLS FOR TEACHING ACADEMIC VOCABULARY
TOOLS FOR TEACHING ACADEMIC VOCABULARYTOOLS FOR TEACHING ACADEMIC VOCABULARY
TOOLS FOR TEACHING ACADEMIC VOCABULARY
CARLOS MARTINEZ
 
Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011
VJN_88_
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
Andi Wu
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)
Tyler Schnoebelen
 

Semelhante a Visualizing Words and Topics with Scattertext (20)

Natural Language Visualization with Scattertext
Natural Language Visualization with ScattertextNatural Language Visualization with Scattertext
Natural Language Visualization with Scattertext
 
MMORPG with Strategic Language Learning Activities for ESL Skills
MMORPG with Strategic Language Learning Activities for ESL SkillsMMORPG with Strategic Language Learning Activities for ESL Skills
MMORPG with Strategic Language Learning Activities for ESL Skills
 
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
Iulia Pasov, Sixt. Trends in sentiment analysis. The entire history from rule...
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
SIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
SIOP Master Tutorial: NLP and Text Mining for I/O PsychologistsSIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
SIOP Master Tutorial: NLP and Text Mining for I/O Psychologists
 
Ecw 21st century skills
Ecw 21st century skillsEcw 21st century skills
Ecw 21st century skills
 
Rigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deploymentRigourous evaluation of nlp models in real world deployment
Rigourous evaluation of nlp models in real world deployment
 
TOOLS FOR TEACHING ACADEMIC VOCABULARY
TOOLS FOR TEACHING ACADEMIC VOCABULARYTOOLS FOR TEACHING ACADEMIC VOCABULARY
TOOLS FOR TEACHING ACADEMIC VOCABULARY
 
User review sites as a resource for large scale Sociolinguistic studies
User review sites as a resource for large scale Sociolinguistic studies User review sites as a resource for large scale Sociolinguistic studies
User review sites as a resource for large scale Sociolinguistic studies
 
Using selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectivesUsing selectors for nouns, verbs and adjectives
Using selectors for nouns, verbs and adjectives
 
Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011Sat lessons power point dt6 10.05.2011
Sat lessons power point dt6 10.05.2011
 
Qualitative approaches to learning analytics
Qualitative approaches to learning analyticsQualitative approaches to learning analytics
Qualitative approaches to learning analytics
 
September 8 2016 Reading League Presentation
September 8 2016 Reading League PresentationSeptember 8 2016 Reading League Presentation
September 8 2016 Reading League Presentation
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Structure of social psych - Lanning Social Psych Winter Conf 2015
Structure of social psych - Lanning Social Psych Winter Conf 2015Structure of social psych - Lanning Social Psych Winter Conf 2015
Structure of social psych - Lanning Social Psych Winter Conf 2015
 
L05 word representation
L05 word representationL05 word representation
L05 word representation
 
Teaching reading
Teaching readingTeaching reading
Teaching reading
 
(1) assignment 7.1 a identifying letters, phonemes, and grapheme
(1) assignment 7.1 a identifying letters, phonemes, and grapheme(1) assignment 7.1 a identifying letters, phonemes, and grapheme
(1) assignment 7.1 a identifying letters, phonemes, and grapheme
 
BibleTech2013.pptx
BibleTech2013.pptxBibleTech2013.pptx
BibleTech2013.pptx
 
Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)Gender and language (linguistics, social network theory, Twitter!)
Gender and language (linguistics, social network theory, Twitter!)
 

Mais de Jason Kessler

The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive Domain
Jason Kessler
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Jason Kessler
 

Mais de Jason Kessler (8)

Lexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary ClassificationLexicon Mining for Semiotic Squares: Exploding Binary Classification
Lexicon Mining for Semiotic Squares: Exploding Binary Classification
 
Jason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with TwitterJason Kessler Problems: What's Wrong with Twitter
Jason Kessler Problems: What's Wrong with Twitter
 
Discovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer BehaviorDiscovering Persuasive Language through Observing Customer Behavior
Discovering Persuasive Language through Observing Customer Behavior
 
Scattertext: A Tool for Visualizing Differences in Language
Scattertext: A Tool for Visualizing Differences in LanguageScattertext: A Tool for Visualizing Differences in Language
Scattertext: A Tool for Visualizing Differences in Language
 
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
From Sentiment to Persuasion Analysis: A Look at Idea Generation ToolsFrom Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
From Sentiment to Persuasion Analysis: A Look at Idea Generation Tools
 
The 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive DomainThe 2010 JDPA Sentiment Corpus for the Automotive Domain
The 2010 JDPA Sentiment Corpus for the Automotive Domain
 
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
Targeting Sentiment Expressions through Supervised Ranking of Linguistic Conf...
 
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
Polling the Blogosphere: a Rule-Based Approach to Belief Classification, By J...
 

Último

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

Último (20)

Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 

Visualizing Words and Topics with Scattertext

  • 1. Visualizing Words and Topics with Scattertext Jason S. Kessler* June 14, 2018 Code for all visualizations is available at: https://github.com/JasonKessler/PyDataSeattle2018 $ pip3 install scattertext @jasonkessler*No, not that Jason Kessler
  • 2. Lexicon speculation Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP. 2002. (ACL 2018 Test of Time Award Winner) @jasonkessler
  • 3. Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. EMNLP. 2002. Lexicon mining ≈ lexicon speculation @jasonkessler
  • 4. One is made from positive reviews, the other negative reviews (Mueller’s wordcloud) @jasonkessler
  • 5. Motivation • What language can be used to better market a product? • Words characteristic of effective (vs ineffective) marketing messages • Uncover framing of political issues • How do Republicans and Democratic politicians talk differently about abortion? • Cultural anthropology • What topics or language are characteristic to groups of people • Psycholinguistics • How language use is associated personality and other personal characteritics • Writing better headlines
  • 6. Language and Demographics Christian Rudder: http://blog.okcupid.com/index.php/page/7/ hobos almond butter 100 Years of Solitude Bikram yoga @jasonkessler
  • 7. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) OKCupid: Words and phrases that distinguish white men. @jasonkessler
  • 8. Explanation OKCupid: Words and phrases that distinguish Latin men. Source: http://blog.okcupid.com/index.php/page/7/ (Rudder 2010) @jasonkessler
  • 9. Ranking with everyone else The smaller the distance from the top left, the higher the association with white men Source: Christian Rudder. Dataclysm. 2014. Phish is highly associated with white men Kpop is not @jasonkessler
  • 10. @jasonkessler my blue eyes Source: Christian Rudder. Dataclysm. 2014.
  • 11. Scattertext pip install scattertext github.com/JasonKessler/scattertext @jasonkessler Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. 2017. - Interactive, d3- based scatterplot - Concise Python API - Automatically displays non- overlapping labels
  • 12. Scaled F-Score • Term-class* associations: • “Good” is associated with the “positive” class • “Bad” with the class “negative” • Core intuition: association relies on two necessary factors • Frequency: How often a term occurs in a class • Precision: P(class|document contains term) • F-Score: • Information retrieval evaluation metric • Harmonic mean between precision and recall • Requires both metrics to be high • *Term: is defined to be a word, phrase or other discrete linguistic element @jasonkessler
  • 14. @jasonkessler Precision Frequency Naïve approach Y-axis: - Precision, i.e. P(class|term), roughly normal distribution - Mean ≅ 0.5, sd ≅ 0.4, X-axis: - Frequency, i.e. P(term|class), roughly power distribution - Mean ≅ 0.00008, sd ≅ 0.008 Color: - Harmonic mean of precision and frequency (blue=high, red=low)
  • 15. @jasonkessler Precision Frequency Problem: • Top words are just stop words. • Why? • Harmonic mean of uneven distributions. • Most words have prec of ~0.5, leads harmonic mean to rely on frequency.
  • 16. Fix: Normalize Precision and Frequency • Task: make precision and frequency similarly distributed • How: take normal CDF of each term’s precision and frequency • Mean and std. computed from data • Right: log normal CDF @jasonkessler This area is the log-normal CDF of the term “beauty” (0.938 ∈ [0,1]). Each tick mark is the log- frequency of a term. *log-normal CDF isn’t used in these charts
  • 18. Positive Scaled-F-Score Good: positive terms make sense! Still some function words, but that’s okay. Note These frequent terms are all very close to 1 on the x-axis, but are ordered NormCDF-Precision NormCDF - Frequency @jasonkessler
  • 19. Pos Freq. Neg Freq. Prec. Freq % Raw Hmean Prec. CDF Freq. CDF Scaled F-Score best 108 36 75.00% 0.22% 0.44% 71.95% 99.50% 83.51% entertaining 58 13 81.69% 0.12% 0.24% 77.07% 90.94% 83.43% fun 73 26 73.74% 0.15% 0.30% 70.92% 95.63% 81.44% heart 45 11 80.36% 0.09% 0.18% 76.09% 84.49% 80.07% Top Scaled-F Score Terms Note: normalized precision and frequency are on comparable scales, allowing for the harmonic mean to take both into account.
  • 20. Problem: highly negative terms are all low frequency NormCDF-Precision NormCDF - Frequency @jasonkessler Solution: • Compute Scaled F- Score association scores for negative reviews. • Use the highest score
  • 21. Scaled-F-Score Positive Scaled F-Score @jasonkessler Negative Scaled F- Score Note: only one obviously negative term
  • 22. @jasonkessler Scaled-F-Score by log- frequency The score can be overly sensitive to very frequent terms, but still doesn’t score them very highly ScaledF-Score Log Frequency This chart over-emphasizes stop words, and has a lot of white space
  • 23. Characteristic Term Detection • General idea • Characteristic terms are more likely to occur in corpus than in general English • “Normal” English: • Peter Norvig’s list of word frequencies from web in late ‘00s. • Algorithm: • Dense rank terms that appear in corpus by their corpus frequencies and their “standard” English frequencies. • Scale term ranks by the number of distinct ranks in the corpus (fewer) or background (greater) • Take rank difference @jasonkessler
  • 24. • Left: most frequent terms in corpus • Positive rank difference between film and movie@jasonkessler
  • 25. Top 10 Characteristic Words by Rank @jasonkessler
  • 26. Y-axis: Scaled F-Score X-axis: Characteristic Rank Delta Some non-movie-like words do affect sentiment @jasonkessler
  • 27. Why not use TF-IDF? • Drastically favors low frequency terms • term in all classes -> idf=1 -> score=0 TF-IDF(Positive)-TF-IDF(Negative) Log Frequency TF IDF @jasonkessler
  • 28. Burt Monroe, Michael Colaresi and Kevin Quinn. Fightin' words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis. 2008.@jasonkessler Monroe et. al (2009) approach • Bayesian approach to term- association • Likelihood: Z-score of log-odds- ratio • Prior: Term frequency in a background corpus • Posterior: Z-score of log-odds-ratio with background counts as smoothing values Popular, but much more tweaking to get to work than Scaled F Score.
  • 29. @jasonkessler Scattertext reimplementation of Monroe et al. See http://nbviewer.jupyter.org/github/JasonKessler/PuPPyTalk/blob/master/notebooks/Class-Association-Scores.ipynb for code. Scattertext implementation (with prior weighting modifications)
  • 30. In defense of stop words Cindy K. Chung and James W. Pennebaker. Counting Little Words in Big Data: The Psychology of Communities, Culture, and History. EASP. 2012 In times of shared crisis, “we” use increases, while “I” use decreases. I/we: age, social integration I: lying, social rank @jasonkessler
  • 31. Function words and gender Newman, ML; Groom, CJ; Handelman LD, Pennebaker, JW. Gender Differences in Language Use: An Analysis of 14,000 Text Samples. 2008. LIWC Dimension Bold: entirely stop words Effect Size (Cohen’s d) (>0 F, <0 M) MANOVA p<.001 All Pronouns (esp. 3rd person) 0.36 Present tense verbs (walk, is, be) 0.18 Feeling (touch, hold, feel) 0.17 Certainty (always, never) 0.14 Word count NS Numbers -0.15 Prepositions -0.17 Words >6 letters -0.24 Swear words -0.22 Articles -0.24 • Performed on a variety of language categories, including speech. • Other studies have found that function words are the best predictors of gender. @jasonkessler
  • 33. Clickbait corpus • Facebook posts from BuzzFeed, NY Times, etc/ from 2010s. • Includes headline and the number of Facebook likes • Scraped by researcher Max Woolf at github.com/minimaxir/clickbait-cluster. • We’ll separate articles from 2016 into the upper third and lower third of likes. • Identify words and phrases that predict likes. • Begin with noun phrases identified from Phrase Machine (Handler et al. 2016) • Filter out redundant NPs. Abram Handler, Matt Denny, Hanna Wallach, and Brendan O'Connor. Bag of what? Simple noun phrase extraction for corpus analysis. NLP+CSS Workshop at EMNLP 2016. @jasonkessler
  • 35. @jasonkessler Scaled-F-Score of engagement by unigram Psycholinguist information: 3rd person pronouns -> high engagement (indicative of female) 2nd person low (male) “dies”: obit Can, guess, how: questions.
  • 36. Clickbait corpus • How do terms with similar meanings differ in terms of their engagement rates? • Use Gensim (https://radimrehurek.com/gensim/) to find word embeddings • Use UMAP (McInnes and Healy 2018) to project them into two dimensions, and explore them with Scattertext. • Locally groups words with similar embeddings together. • Better alternative to T-SNE; allows for cosine instead of Euclidean distance criteria Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Arxiv. 2018. @jasonkessler
  • 37. @jasonkessler This island is mostly food related. “Chocolate” and “cake” are highly engaging, but “breakfast” has predictive of low engagement. Term positions from determined by UMAP, color by Scaled F-Score for engagement.
  • 38. Clickbait corpus • How do the Times and Buzzfeed differ in what they talk about, and their content engages their readers? • Scattertext can easily create visualizations to help answer these questions. • First, we’ll look at how what engages for Buzzfeed contrasts with what engages for the Times, and vice versa Leland McInnes, John Healy. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Arxiv. 2018. @jasonkessler
  • 39. Oddly, NY Times readers distinctly like articles about sex, death, and which are written in a smug tone. This chart doesn’t give a good sense of what language is more associated with one site. @jasonkessler
  • 40. This chart let’s you know how Buzzfeed and the Times are distinct, while still distinguishing engaging content, @jasonkessler
  • 41. Thank you! Questions? @jasonkessler Jason S. Kessler Global AI Conference April 27, 2018 https://github.com/JasonKessler/GlobalAI2018

Notas do Editor

  1. Selected termsWhich words and phrases statistically distinguish ethnic groups and genders?
  2. Selected terms
  3. Selected terms