Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis
1. Packing and Unpacking the Bag of Words:
Introducing a Toolkit for Inductive Automated
Frame Analysis
Damian Trilling & Jeroen Jonkman
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
WAPOR, Buenos Aires, 16–19 June 2015
2. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Packing and Unpacking the Bag of Words Trilling & Jonkman
3. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and search
strings
• advanced: supervised
machine learning
Packing and Unpacking the Bag of Words Trilling & Jonkman
4. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and search
strings
• advanced: supervised
machine learning
Inductive
• word frequencies and
co-occurrences
• visualizations
• principal component
analysis
• cluster analysis
• latent dirichlet allocation
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
5. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Inductive
• word frequencies and
co-occurrences
• visualizations
• principal component
analysis
• cluster analysis
• latent dirichlet allocation
• . . .
This is the focus of our study
Packing and Unpacking the Bag of Words Trilling & Jonkman
6. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
Packing and Unpacking the Bag of Words Trilling & Jonkman
7. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒
topic modeling)
Packing and Unpacking the Bag of Words Trilling & Jonkman
8. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒
topic modeling)
• Do we expect each element to occur in one and only one
frame? (⇒ PCA)
Packing and Unpacking the Bag of Words Trilling & Jonkman
9. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒
topic modeling)
• Do we expect each element to occur in one and only one
frame? (⇒ PCA)
• Do we need to distinguish between actors, actions, . . . — or
are all words taken into consideration equally?
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
10. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Practical issues
Practical issues
• no standard software (but: more and more R-packages and
Python modules)
• reliance on inaccessible, self-written, or proprietary software
• lack of knowledge in the field
• size of the datasets
Packing and Unpacking the Bag of Words Trilling & Jonkman
11. Overview Problems Sample implementation: INFRA Empirical example Conclusions
A catalogue of criteria
A catalogue of criteria
A toolkit for automated framing analysis should. . .
1 not depend on commercial software
2 run on all major operating systems
3 be scalable: usable on a laptop, but also on powerful servers
to analyze millions of documents.
4 be flexible and open: adoptable to own needs
5 have a powerful database engine on the background
Packing and Unpacking the Bag of Words Trilling & Jonkman
12. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Sample implementation: INFRA
To meet these criteria, we wrote INFRA in Python, using the
NoSQL database MongoDB. The toolkit will be made freely
available, both as source code and via a web interface.
Packing and Unpacking the Bag of Words Trilling & Jonkman
13. Data (e.g., Lexis
Nexis articles)
Import filter
NoSQL database
Cleaning and pre-
processing filters
Cleaned NoSQL
database
word frequencies
and co-occurences
log likelihood visualizations
define details for
analysis (e.g., im-
portant actors)
dictionary filter/named
entity recognition
Latent dirich-
let allocation
Principal com-
ponent analysis
Cluster analysis
Data management phase
Analysis phase
14. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Central storage
Data management phase handled on the server; analyses can be
handled either on the server (SSH) or locally (INFRA)
External data
MongoDB server
Computer2 Computer3Computer1 Computer4
Server: Linux-VM with MongoDB server; Clients: Python, INFRA, mongo client
Packing and Unpacking the Bag of Words Trilling & Jonkman
15. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming its
shortcomings
Packing and Unpacking the Bag of Words Trilling & Jonkman
16. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming its
shortcomings
In the preprocessing phase
• all information is still
• we can use custom regexp-based rules and filters
e.g.: if a text contains [list of synomys of A] and [list of synomys of B],
replace [synomys of A] with C
• extremely useful for unifying actors that are referred in several ways
Packing and Unpacking the Bag of Words Trilling & Jonkman
17. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming its
shortcomings
In the preprocessing phase
• all information is still
• we can use custom regexp-based rules and filters
e.g.: if a text contains [list of synomys of A] and [list of synomys of B],
replace [synomys of A] with C
• extremely useful for unifying actors that are referred in several ways
In the analysis phase
• work with a much faster dataset that contains only the
necessary information
• no need to deal with misspellings and variations any more
Packing and Unpacking the Bag of Words Trilling & Jonkman
18. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Towards a “best practice” of inductive framing analysis
In the data management phase
• spend much time on re-coding relevant multi-word entities to
avoid noise (of course, “Barack” and “Obama” occur
together) and recode synonyms (how would you otherwise
reliably estimate frequencies?)
⇒ especially important for questions like “how is actor X
framed?”
• regular expressions instead of simple word lists!
• make an informed decision on how to harmonize the dataset
(stopword removal, stemming (?), POS tagging (?))
And: share these procedures!
Packing and Unpacking the Bag of Words Trilling & Jonkman
19. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Towards a “best practice” of inductive framing analysis
In the analysis phase
• background knowledge necessary (face validity)
• robustness: do slightly different parameters deliver similar
results?
• too small dataset ⇒ sensitivity for atypical events (scandals
etc.) ⇒ discovering topic rather than frame
• difference between statistical predictive power and
meaningfulness
Packing and Unpacking the Bag of Words Trilling & Jonkman
20. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Empirical example:
Dutch business news
Packing and Unpacking the Bag of Words Trilling & Jonkman
21. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Steps
Preprocessing steps
1 Ingest and parse all possibly relevant articles (≈ 500 000)
2 Compose list of ≈ 1 500 regular expressions to substitute
synonyms and combinations to correctly code actors, allowing
for conditional substitutions
3 Remove stopwords, punctuation, etc.
4 Determine part-of-speech, keep only nouns, adjectives, adverbs
Packing and Unpacking the Bag of Words Trilling & Jonkman
22. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Steps
Analysis steps
1 Determine relevant actors with frequency counts, filtering out
all non-Dutch words (alternative: named entity recognition)
2 Conduct PCA, cluster analysis, and LDA – additionally, count
frequency of actor mentions
3 Finetuning, repeating, choose final model
Packing and Unpacking the Bag of Words Trilling & Jonkman
23. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: Attention over time
Overview of news attention: attention to 100 firms in company
news and entropy (red line) from 2007 to 2013.
Packing and Unpacking the Bag of Words Trilling & Jonkman
24. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: Topics
Results of a topic model
Packing and Unpacking the Bag of Words Trilling & Jonkman
25. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: Components
Results of a principal component analysis
Packing and Unpacking the Bag of Words Trilling & Jonkman
26. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: co-occurrences
Results of a network visualization of co-occurrances
Packing and Unpacking the Bag of Words Trilling & Jonkman
27. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Conclusions
• We developed a toolkit that integrates all recent methods
used for automated inductive framing analysis
• It is free
• It works with large-scale datasets
• It can be used by a whole group together
Packing and Unpacking the Bag of Words Trilling & Jonkman
28. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Next steps
• RE the tool: graphical interface
• RE the method: systematic validation study; comparing
different approaches and settings
Packing and Unpacking the Bag of Words Trilling & Jonkman
29. Overview Problems Sample implementation: INFRA Empirical example Conclusions
Questions
Questions?
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Packing and Unpacking the Bag of Words Trilling & Jonkman