SlideShare uma empresa Scribd logo
1 de 29
Baixar para ler offline
Packing and Unpacking the Bag of Words:
Introducing a Toolkit for Inductive Automated
Frame Analysis
Damian Trilling & Jeroen Jonkman
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
WAPOR, Buenos Aires, 16–19 June 2015
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and search
strings
• advanced: supervised
machine learning
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Deductive
• simple: word lists and search
strings
• advanced: supervised
machine learning
Inductive
• word frequencies and
co-occurrences
• visualizations
• principal component
analysis
• cluster analysis
• latent dirichlet allocation
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Automated Framing analysis
Inductive
• word frequencies and
co-occurrences
• visualizations
• principal component
analysis
• cluster analysis
• latent dirichlet allocation
• . . .
This is the focus of our study
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒
topic modeling)
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒
topic modeling)
• Do we expect each element to occur in one and only one
frame? (⇒ PCA)
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Methodological issues
Methodological issues
What constitutes a frame?
— and how does this translate to an operationalization?
• Is a frame fundamentally different from a (sub-)topic? (⇒
topic modeling)
• Do we expect each element to occur in one and only one
frame? (⇒ PCA)
• Do we need to distinguish between actors, actions, . . . — or
are all words taken into consideration equally?
• . . .
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Practical issues
Practical issues
• no standard software (but: more and more R-packages and
Python modules)
• reliance on inaccessible, self-written, or proprietary software
• lack of knowledge in the field
• size of the datasets
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
A catalogue of criteria
A catalogue of criteria
A toolkit for automated framing analysis should. . .
1 not depend on commercial software
2 run on all major operating systems
3 be scalable: usable on a laptop, but also on powerful servers
to analyze millions of documents.
4 be flexible and open: adoptable to own needs
5 have a powerful database engine on the background
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Sample implementation: INFRA
To meet these criteria, we wrote INFRA in Python, using the
NoSQL database MongoDB. The toolkit will be made freely
available, both as source code and via a web interface.
Packing and Unpacking the Bag of Words Trilling & Jonkman
Data (e.g., Lexis
Nexis articles)
Import filter
NoSQL database
Cleaning and pre-
processing filters
Cleaned NoSQL
database
word frequencies
and co-occurences
log likelihood visualizations
define details for
analysis (e.g., im-
portant actors)
dictionary filter/named
entity recognition
Latent dirich-
let allocation
Principal com-
ponent analysis
Cluster analysis
Data management phase
Analysis phase
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Central storage
Data management phase handled on the server; analyses can be
handled either on the server (SSH) or locally (INFRA)
External data
MongoDB server
Computer2 Computer3Computer1 Computer4
Server: Linux-VM with MongoDB server; Clients: Python, INFRA, mongo client
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming its
shortcomings
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming its
shortcomings
In the preprocessing phase
• all information is still
• we can use custom regexp-based rules and filters
e.g.: if a text contains [list of synomys of A] and [list of synomys of B],
replace [synomys of A] with C
• extremely useful for unifying actors that are referred in several ways
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Enjoying the advantages of BOW — and overcoming its
shortcomings
In the preprocessing phase
• all information is still
• we can use custom regexp-based rules and filters
e.g.: if a text contains [list of synomys of A] and [list of synomys of B],
replace [synomys of A] with C
• extremely useful for unifying actors that are referred in several ways
In the analysis phase
• work with a much faster dataset that contains only the
necessary information
• no need to deal with misspellings and variations any more
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Towards a “best practice” of inductive framing analysis
In the data management phase
• spend much time on re-coding relevant multi-word entities to
avoid noise (of course, “Barack” and “Obama” occur
together) and recode synonyms (how would you otherwise
reliably estimate frequencies?)
⇒ especially important for questions like “how is actor X
framed?”
• regular expressions instead of simple word lists!
• make an informed decision on how to harmonize the dataset
(stopword removal, stemming (?), POS tagging (?))
And: share these procedures!
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Design
Towards a “best practice” of inductive framing analysis
In the analysis phase
• background knowledge necessary (face validity)
• robustness: do slightly different parameters deliver similar
results?
• too small dataset ⇒ sensitivity for atypical events (scandals
etc.) ⇒ discovering topic rather than frame
• difference between statistical predictive power and
meaningfulness
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Empirical example:
Dutch business news
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Steps
Preprocessing steps
1 Ingest and parse all possibly relevant articles (≈ 500 000)
2 Compose list of ≈ 1 500 regular expressions to substitute
synonyms and combinations to correctly code actors, allowing
for conditional substitutions
3 Remove stopwords, punctuation, etc.
4 Determine part-of-speech, keep only nouns, adjectives, adverbs
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Steps
Analysis steps
1 Determine relevant actors with frequency counts, filtering out
all non-Dutch words (alternative: named entity recognition)
2 Conduct PCA, cluster analysis, and LDA – additionally, count
frequency of actor mentions
3 Finetuning, repeating, choose final model
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: Attention over time
Overview of news attention: attention to 100 firms in company
news and entropy (red line) from 2007 to 2013.
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: Topics
Results of a topic model
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: Components
Results of a principal component analysis
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Output
Example: co-occurrences
Results of a network visualization of co-occurrances
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Conclusions
• We developed a toolkit that integrates all recent methods
used for automated inductive framing analysis
• It is free
• It works with large-scale datasets
• It can be used by a whole group together
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Next steps
• RE the tool: graphical interface
• RE the method: systematic validation study; comparing
different approaches and settings
Packing and Unpacking the Bag of Words Trilling & Jonkman
Overview Problems Sample implementation: INFRA Empirical example Conclusions
Questions
Questions?
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Packing and Unpacking the Bag of Words Trilling & Jonkman

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016Knowledge Patterns SSSW2016
Knowledge Patterns SSSW2016
 
Surveys in Software Engineering
Surveys in Software EngineeringSurveys in Software Engineering
Surveys in Software Engineering
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Blenderbot
BlenderbotBlenderbot
Blenderbot
 
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment AnalysisA Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 
The State of #NLProc
The State of #NLProcThe State of #NLProc
The State of #NLProc
 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
 
2.18 tổ chức lớp viết báo khoa học kỹ thuật đăng trên tạp chí quốc tế (13)
2.18 tổ chức lớp viết báo khoa học kỹ thuật đăng trên tạp chí quốc tế (13)2.18 tổ chức lớp viết báo khoa học kỹ thuật đăng trên tạp chí quốc tế (13)
2.18 tổ chức lớp viết báo khoa học kỹ thuật đăng trên tạp chí quốc tế (13)
 
Case Study Research in Software Engineering
Case Study Research in Software EngineeringCase Study Research in Software Engineering
Case Study Research in Software Engineering
 
Design Thinking for Requirements Engineering
Design Thinking for Requirements EngineeringDesign Thinking for Requirements Engineering
Design Thinking for Requirements Engineering
 
Machine Learning in NLP
Machine Learning in NLPMachine Learning in NLP
Machine Learning in NLP
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
 
010821+presentation+oti.ppt
010821+presentation+oti.ppt010821+presentation+oti.ppt
010821+presentation+oti.ppt
 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
 

Semelhante a Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
Lawrie Hunter
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
aaroncollie
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
Laura Chiticariu
 
Name ID Number Section 1 SummaryAt least 250 words as counted.docx
Name ID Number Section 1 SummaryAt least 250 words as counted.docxName ID Number Section 1 SummaryAt least 250 words as counted.docx
Name ID Number Section 1 SummaryAt least 250 words as counted.docx
roushhsiu
 

Semelhante a Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis (20)

C++ plus data structures, 3rd edition (2003)
C++ plus data structures, 3rd edition (2003)C++ plus data structures, 3rd edition (2003)
C++ plus data structures, 3rd edition (2003)
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
 
Data Structures and Abstractions with Java 5th Edition Carrano Solutions Manual
Data Structures and Abstractions with Java 5th Edition Carrano Solutions ManualData Structures and Abstractions with Java 5th Edition Carrano Solutions Manual
Data Structures and Abstractions with Java 5th Edition Carrano Solutions Manual
 
Final presentation
Final presentationFinal presentation
Final presentation
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Text Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel LingText Analytics in Enterprise Search - Daniel Ling
Text Analytics in Enterprise Search - Daniel Ling
 
Text Analytics in Enterprise Search
Text Analytics in Enterprise SearchText Analytics in Enterprise Search
Text Analytics in Enterprise Search
 
Master Beginners Workshop - Feb 2023
Master Beginners Workshop - Feb 2023Master Beginners Workshop - Feb 2023
Master Beginners Workshop - Feb 2023
 
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
DITA Surprise, Unwrapping DITA Best Practices - tekom tcworld 2016
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
 
Debugging machine-learning
Debugging machine-learningDebugging machine-learning
Debugging machine-learning
 
Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...RuleML2015 - Tutorial -  Powerful Practical Semantic Rules in Rulelog - Funda...
RuleML2015 - Tutorial - Powerful Practical Semantic Rules in Rulelog - Funda...
 
Declarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemTDeclarative Multilingual Information Extraction with SystemT
Declarative Multilingual Information Extraction with SystemT
 
qualitative.ppt
qualitative.pptqualitative.ppt
qualitative.ppt
 
Using Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative ResearchUsing Computer as a Research Assistant in Qualitative Research
Using Computer as a Research Assistant in Qualitative Research
 
Presentation
PresentationPresentation
Presentation
 
Name ID Number Section 1 SummaryAt least 250 words as counted.docx
Name ID Number Section 1 SummaryAt least 250 words as counted.docxName ID Number Section 1 SummaryAt least 250 words as counted.docx
Name ID Number Section 1 SummaryAt least 250 words as counted.docx
 

Mais de Department of Communication Science, University of Amsterdam

Mais de Department of Communication Science, University of Amsterdam (20)

BDACA - Lecture8
BDACA - Lecture8BDACA - Lecture8
BDACA - Lecture8
 
BDACA - Lecture7
BDACA - Lecture7BDACA - Lecture7
BDACA - Lecture7
 
BDACA - Lecture6
BDACA - Lecture6BDACA - Lecture6
BDACA - Lecture6
 
BDACA - Tutorial5
BDACA - Tutorial5BDACA - Tutorial5
BDACA - Tutorial5
 
BDACA - Lecture5
BDACA - Lecture5BDACA - Lecture5
BDACA - Lecture5
 
BDACA - Lecture4
BDACA - Lecture4BDACA - Lecture4
BDACA - Lecture4
 
BDACA - Lecture3
BDACA - Lecture3BDACA - Lecture3
BDACA - Lecture3
 
BDACA - Lecture2
BDACA - Lecture2BDACA - Lecture2
BDACA - Lecture2
 
BDACA - Tutorial1
BDACA - Tutorial1BDACA - Tutorial1
BDACA - Tutorial1
 
BDACA - Lecture1
BDACA - Lecture1BDACA - Lecture1
BDACA - Lecture1
 
BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7BDACA1617s2 - Lecture7
BDACA1617s2 - Lecture7
 
BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6BDACA1617s2 - Lecture6
BDACA1617s2 - Lecture6
 
BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5BDACA1617s2 - Lecture5
BDACA1617s2 - Lecture5
 
BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4BDACA1617s2 - Lecture4
BDACA1617s2 - Lecture4
 
BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3BDACA1617s2 - Lecture3
BDACA1617s2 - Lecture3
 
BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2BDACA1617s2 - Lecture 2
BDACA1617s2 - Lecture 2
 
BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1BDACA1617s2 - Tutorial 1
BDACA1617s2 - Tutorial 1
 
BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1BDACA1617s2 - Lecture 1
BDACA1617s2 - Lecture 1
 
Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...Media diets in an age of apps and social media: Dealing with a third layer of...
Media diets in an age of apps and social media: Dealing with a third layer of...
 
Conceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news itemsConceptualizing and measuring news exposure as network of users and news items
Conceptualizing and measuring news exposure as network of users and news items
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Último (20)

Single or Multiple melodic lines structure
Single or Multiple melodic lines structureSingle or Multiple melodic lines structure
Single or Multiple melodic lines structure
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Plant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptxPlant propagation: Sexual and Asexual propapagation.pptx
Plant propagation: Sexual and Asexual propapagation.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 

Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis

  • 1. Packing and Unpacking the Bag of Words: Introducing a Toolkit for Inductive Automated Frame Analysis Damian Trilling & Jeroen Jonkman d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam WAPOR, Buenos Aires, 16–19 June 2015
  • 2. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 3. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Deductive • simple: word lists and search strings • advanced: supervised machine learning Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 4. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Deductive • simple: word lists and search strings • advanced: supervised machine learning Inductive • word frequencies and co-occurrences • visualizations • principal component analysis • cluster analysis • latent dirichlet allocation • . . . Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 5. Overview Problems Sample implementation: INFRA Empirical example Conclusions Automated Framing analysis Inductive • word frequencies and co-occurrences • visualizations • principal component analysis • cluster analysis • latent dirichlet allocation • . . . This is the focus of our study Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 6. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 7. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? • Is a frame fundamentally different from a (sub-)topic? (⇒ topic modeling) Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 8. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? • Is a frame fundamentally different from a (sub-)topic? (⇒ topic modeling) • Do we expect each element to occur in one and only one frame? (⇒ PCA) Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 9. Overview Problems Sample implementation: INFRA Empirical example Conclusions Methodological issues Methodological issues What constitutes a frame? — and how does this translate to an operationalization? • Is a frame fundamentally different from a (sub-)topic? (⇒ topic modeling) • Do we expect each element to occur in one and only one frame? (⇒ PCA) • Do we need to distinguish between actors, actions, . . . — or are all words taken into consideration equally? • . . . Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 10. Overview Problems Sample implementation: INFRA Empirical example Conclusions Practical issues Practical issues • no standard software (but: more and more R-packages and Python modules) • reliance on inaccessible, self-written, or proprietary software • lack of knowledge in the field • size of the datasets Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 11. Overview Problems Sample implementation: INFRA Empirical example Conclusions A catalogue of criteria A catalogue of criteria A toolkit for automated framing analysis should. . . 1 not depend on commercial software 2 run on all major operating systems 3 be scalable: usable on a laptop, but also on powerful servers to analyze millions of documents. 4 be flexible and open: adoptable to own needs 5 have a powerful database engine on the background Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 12. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Sample implementation: INFRA To meet these criteria, we wrote INFRA in Python, using the NoSQL database MongoDB. The toolkit will be made freely available, both as source code and via a web interface. Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 13. Data (e.g., Lexis Nexis articles) Import filter NoSQL database Cleaning and pre- processing filters Cleaned NoSQL database word frequencies and co-occurences log likelihood visualizations define details for analysis (e.g., im- portant actors) dictionary filter/named entity recognition Latent dirich- let allocation Principal com- ponent analysis Cluster analysis Data management phase Analysis phase
  • 14. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Central storage Data management phase handled on the server; analyses can be handled either on the server (SSH) or locally (INFRA) External data MongoDB server Computer2 Computer3Computer1 Computer4 Server: Linux-VM with MongoDB server; Clients: Python, INFRA, mongo client Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 15. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Enjoying the advantages of BOW — and overcoming its shortcomings Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 16. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Enjoying the advantages of BOW — and overcoming its shortcomings In the preprocessing phase • all information is still • we can use custom regexp-based rules and filters e.g.: if a text contains [list of synomys of A] and [list of synomys of B], replace [synomys of A] with C • extremely useful for unifying actors that are referred in several ways Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 17. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Enjoying the advantages of BOW — and overcoming its shortcomings In the preprocessing phase • all information is still • we can use custom regexp-based rules and filters e.g.: if a text contains [list of synomys of A] and [list of synomys of B], replace [synomys of A] with C • extremely useful for unifying actors that are referred in several ways In the analysis phase • work with a much faster dataset that contains only the necessary information • no need to deal with misspellings and variations any more Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 18. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Towards a “best practice” of inductive framing analysis In the data management phase • spend much time on re-coding relevant multi-word entities to avoid noise (of course, “Barack” and “Obama” occur together) and recode synonyms (how would you otherwise reliably estimate frequencies?) ⇒ especially important for questions like “how is actor X framed?” • regular expressions instead of simple word lists! • make an informed decision on how to harmonize the dataset (stopword removal, stemming (?), POS tagging (?)) And: share these procedures! Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 19. Overview Problems Sample implementation: INFRA Empirical example Conclusions Design Towards a “best practice” of inductive framing analysis In the analysis phase • background knowledge necessary (face validity) • robustness: do slightly different parameters deliver similar results? • too small dataset ⇒ sensitivity for atypical events (scandals etc.) ⇒ discovering topic rather than frame • difference between statistical predictive power and meaningfulness Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 20. Overview Problems Sample implementation: INFRA Empirical example Conclusions Empirical example: Dutch business news Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 21. Overview Problems Sample implementation: INFRA Empirical example Conclusions Steps Preprocessing steps 1 Ingest and parse all possibly relevant articles (≈ 500 000) 2 Compose list of ≈ 1 500 regular expressions to substitute synonyms and combinations to correctly code actors, allowing for conditional substitutions 3 Remove stopwords, punctuation, etc. 4 Determine part-of-speech, keep only nouns, adjectives, adverbs Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 22. Overview Problems Sample implementation: INFRA Empirical example Conclusions Steps Analysis steps 1 Determine relevant actors with frequency counts, filtering out all non-Dutch words (alternative: named entity recognition) 2 Conduct PCA, cluster analysis, and LDA – additionally, count frequency of actor mentions 3 Finetuning, repeating, choose final model Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 23. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: Attention over time Overview of news attention: attention to 100 firms in company news and entropy (red line) from 2007 to 2013. Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 24. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: Topics Results of a topic model Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 25. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: Components Results of a principal component analysis Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 26. Overview Problems Sample implementation: INFRA Empirical example Conclusions Output Example: co-occurrences Results of a network visualization of co-occurrances Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 27. Overview Problems Sample implementation: INFRA Empirical example Conclusions Conclusions • We developed a toolkit that integrates all recent methods used for automated inductive framing analysis • It is free • It works with large-scale datasets • It can be used by a whole group together Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 28. Overview Problems Sample implementation: INFRA Empirical example Conclusions Next steps • RE the tool: graphical interface • RE the method: systematic validation study; comparing different approaches and settings Packing and Unpacking the Bag of Words Trilling & Jonkman
  • 29. Overview Problems Sample implementation: INFRA Empirical example Conclusions Questions Questions? d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Packing and Unpacking the Bag of Words Trilling & Jonkman