SlideShare uma empresa Scribd logo
1 de 21
Red / Blue
Using Machine Learning to Build an
Ideologically Balanced News Diet
Salil Doshi
Sam Goodgame
Susan Eun Park
Paul Platzman
May 21st, 2016
May 15th, 2016 -- Six Days Ago...
“...Today in every phone in one of
your pockets we have access to
more information than at any time
in human history, at a touch of a
button. But, ironically, the flood of
information hasn’t made us more
discerning of the truth. In some
ways, it’s just made us more
confident in our ignorance. We
assume whatever is on the web
must be true. We search for sites
that just reinforce our own
predispositions.”
-President Obama, Rutgers Commencement Address
Pew Research Center
April 29, 2014
Architecture
Build Phase
Training Data Ingestion and Wrangling
Data Transformation
Removed common English words and candidate and moderator names
Vectorized the Data
Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values
Sample TF-IDF Vectorized Matrix:
Model Estimators
Binary Classification Models:
Logistic Regression (LR)
Multinomial Naive Bayes
(MNB)
Support Vector Machine
(SVM)
Feature Engineering
Truncated Singular Value Decomposition (TSVD)
Reduced number of features without compromising predictive
performance
11,228 features --> 2,000 features
No reduction in F-1 Score or Accuracy Score
Models with fewer than 2,000 features experienced diminished
performance
Trend observed across each model form
Parameter Tuning: Using Grid Search
● Optimized ‘C’ Value, the penalty parameter
● Maintained generalizability of model to prediction data
http://www.intechopen.com/source/html/4
5102/media/image44.png
SVM Model Performance Metrics
Precision Recall F-1 Score
Democratic 0.76 0.58 0.66
Republican 0.86 0.93 0.89
Average/Total 0.83 0.84 0.83
Correct Democratic Incorrect Democratic
n=392 n=279
Correct Republican Incorrect Republican
n=1693 n=121
Overall Accuracy Rate: 84%
Operational Phase
Prediction Results: Normalized Spectrum
● 79% of all documents were classified as Republican
Prediction Results: Media Source Spectrum
Prediction Results vs. Pew Research Center Results
Discussion
Results don’t match ideological spectrum of audiences.
Several potential interpretations:
Republican stories dominated news cycles
Republican candidates more regularly used pre-existing
media language
Oral language is not strongly predictive of written
language
Methodological Self-Evaluation (1)
● Strengths:
○ Expansion of instance set to reduce model performance variation
○ Removal of moderator speech
○ Removal of custom stop words
○ Employed a variety of model forms
○ Reduced feature set size without impeding performance
○ Optimized ‘C’ parameter value
Methodological Self-Evaluation (2)
● Shortcomings:
○ RSS feed content was not always ideal or consistent
■ Contained ‘jQuery’ or advertisement placeholders
■ Variety in article length
■ Variable number of instances from each media outlet
○ Single source of training data
○ Uneven distribution of red/blue training data
Looking Towards Future Iterations
● Future studies could…
○ Use additional training data sources
○ Encompass prediction data of greater breadth and
depth: more news sources and more articles per source
○ Include more feature engineering to account for
differently formatted RSS feeds
○ Predict oral political dialogue
For Posterity
● Implications for partisanship...
○ The potential virtue of an ideologically
balanced diet
○ A shift in media engagement behaviors could
promote open-mindedness and compromise
○ This, in turn, could promote legislative
functioning
Questions?

Mais conteúdo relacionado

Mais procurados

2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...University Politehnica Bucharest
 
Trending Topic in Social Networks
Trending Topic in Social NetworksTrending Topic in Social Networks
Trending Topic in Social NetworksComInSyS
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitterazubiaga
 
Asking Questions of Data
Asking Questions of DataAsking Questions of Data
Asking Questions of DataTony Hirst
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake NewsErika Siregar
 
Reference List Citations - APA 6th Edition
Reference List Citations - APA 6th EditionReference List Citations - APA 6th Edition
Reference List Citations - APA 6th EditionJanice Orcutt
 
Data Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionData Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionMarco Brambilla
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksAshish Arora
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Margaret Wang
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalHarry Potter
 
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...Monica Powell
 
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...haiko_lietz
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016Saurabh Deochake
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao
 
Measuring Opinion Credibility in Twiiter
Measuring Opinion Credibility in TwiiterMeasuring Opinion Credibility in Twiiter
Measuring Opinion Credibility in Twiitermthandar
 
Hao lyu slides_sarcasm
Hao lyu slides_sarcasmHao lyu slides_sarcasm
Hao lyu slides_sarcasmHao Lyu
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysisTaylor Graham
 
Metodologia para el analisis de redes sociales
Metodologia para el analisis de redes socialesMetodologia para el analisis de redes sociales
Metodologia para el analisis de redes socialesMontse Fernández Crespo
 

Mais procurados (20)

2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...2012 Presidential Elections on Twitter - An Analysis of How the US and French...
2012 Presidential Elections on Twitter - An Analysis of How the US and French...
 
Trending Topic in Social Networks
Trending Topic in Social NetworksTrending Topic in Social Networks
Trending Topic in Social Networks
 
Microposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on TwitterMicroposts2015 - Social Spam Detection on Twitter
Microposts2015 - Social Spam Detection on Twitter
 
Asking Questions of Data
Asking Questions of DataAsking Questions of Data
Asking Questions of Data
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
 
Reference List Citations - APA 6th Edition
Reference List Citations - APA 6th EditionReference List Citations - APA 6th Edition
Reference List Citations - APA 6th Edition
 
Data Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extractionData Cleaning for social media knowledge extraction
Data Cleaning for social media knowledge extraction
 
Presentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social NetworksPresentation-Detecting Spammers on Social Networks
Presentation-Detecting Spammers on Social Networks
 
Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461Data.Mining.C.8(Ii).Web Mining 570802461
Data.Mining.C.8(Ii).Web Mining 570802461
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...
 
WTFW
WTFWWTFW
WTFW
 
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...
Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...
 
srd117.final.512Spring2016
srd117.final.512Spring2016srd117.final.512Spring2016
srd117.final.512Spring2016
 
Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter Groundhog Day: Near-Duplicate Detection on Twitter
Groundhog Day: Near-Duplicate Detection on Twitter
 
Measuring Opinion Credibility in Twiiter
Measuring Opinion Credibility in TwiiterMeasuring Opinion Credibility in Twiiter
Measuring Opinion Credibility in Twiiter
 
Stack_Overflow-Network_Graph
Stack_Overflow-Network_GraphStack_Overflow-Network_Graph
Stack_Overflow-Network_Graph
 
Hao lyu slides_sarcasm
Hao lyu slides_sarcasmHao lyu slides_sarcasm
Hao lyu slides_sarcasm
 
GeospatialDataAnalysis
GeospatialDataAnalysisGeospatialDataAnalysis
GeospatialDataAnalysis
 
Metodologia para el analisis de redes sociales
Metodologia para el analisis de redes socialesMetodologia para el analisis de redes sociales
Metodologia para el analisis de redes sociales
 

Destaque

Capital Bikeshare Presentation
Capital Bikeshare PresentationCapital Bikeshare Presentation
Capital Bikeshare Presentationdonahuerm
 
Analysis of differential investor performance captstone presentation final
Analysis of differential investor  performance   captstone  presentation finalAnalysis of differential investor  performance   captstone  presentation final
Analysis of differential investor performance captstone presentation finalHoward Ho
 
Georgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone ProjectGeorgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone ProjectMark Phillips
 
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...Brittne Kakulla, Ph.D.
 
Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)Noah Turner
 

Destaque (6)

Capital Bikeshare Presentation
Capital Bikeshare PresentationCapital Bikeshare Presentation
Capital Bikeshare Presentation
 
Analysis of differential investor performance captstone presentation final
Analysis of differential investor  performance   captstone  presentation finalAnalysis of differential investor  performance   captstone  presentation final
Analysis of differential investor performance captstone presentation final
 
Hotel Performance FINAL
Hotel Performance FINALHotel Performance FINAL
Hotel Performance FINAL
 
Georgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone ProjectGeorgetown Data Analytics - Team 1 Capstone Project
Georgetown Data Analytics - Team 1 Capstone Project
 
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...
 
Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)Georgetown Data Analytics Project (Team DC)
Georgetown Data Analytics Project (Team DC)
 

Semelhante a Red Blue Presentation

Quantitative and Digital Skills of International Journalism and Communication...
Quantitative and Digital Skills of International Journalism and Communication...Quantitative and Digital Skills of International Journalism and Communication...
Quantitative and Digital Skills of International Journalism and Communication...J T "Tom" Johnson
 
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...Stuart Shulman
 
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataHow NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataDataCards
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisDiana Maynard
 
PEARC17: Debugging a Biased Student Selection System
PEARC17: Debugging a Biased Student Selection SystemPEARC17: Debugging a Biased Student Selection System
PEARC17: Debugging a Biased Student Selection SystemLorna Rivera
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaDiana Maynard
 
Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?Katrien Verbert
 
Week 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docxWeek 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docxmelbruce90096
 
Brightfind world usability day 2016 full deck final
Brightfind world usability day 2016   full deck finalBrightfind world usability day 2016   full deck final
Brightfind world usability day 2016 full deck finalBrightfind
 
6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docx6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docxpriestmanmable
 
6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docx6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docxsodhi3
 
[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...
[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...
[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...DataScienceConferenc1
 
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...COST Action TD1210
 
Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Katrien Verbert
 
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesStep Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesLuciano Pesci, PhD
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and VisualizationDr. Neil Brittliff
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Stats Statswork
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 

Semelhante a Red Blue Presentation (20)

Quantitative and Digital Skills of International Journalism and Communication...
Quantitative and Digital Skills of International Journalism and Communication...Quantitative and Digital Skills of International Journalism and Communication...
Quantitative and Digital Skills of International Journalism and Communication...
 
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...
 
How NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling DataHow NOT to Aggregrate Polling Data
How NOT to Aggregrate Polling Data
 
Tools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media AnalysisTools for (Almost) Real-Time Social Media Analysis
Tools for (Almost) Real-Time Social Media Analysis
 
Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
PEARC17: Debugging a Biased Student Selection System
PEARC17: Debugging a Biased Student Selection SystemPEARC17: Debugging a Biased Student Selection System
PEARC17: Debugging a Biased Student Selection System
 
Abstract
AbstractAbstract
Abstract
 
GATE: a text analysis tool for social media
GATE: a text analysis tool for social mediaGATE: a text analysis tool for social media
GATE: a text analysis tool for social media
 
Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?Human-centered AI: how can we support end-users to interact with AI?
Human-centered AI: how can we support end-users to interact with AI?
 
Week 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docxWeek 11Collection of Data – questionnaire and Instruments & .docx
Week 11Collection of Data – questionnaire and Instruments & .docx
 
Brightfind world usability day 2016 full deck final
Brightfind world usability day 2016   full deck finalBrightfind world usability day 2016   full deck final
Brightfind world usability day 2016 full deck final
 
6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docx6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docx
 
6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docx6711SafeAssign Originality Report69Total S.docx
6711SafeAssign Originality Report69Total S.docx
 
[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...
[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...
[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...
 
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...
Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...
 
Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?Human-centered AI: how can we support lay users to understand AI?
Human-centered AI: how can we support lay users to understand AI?
 
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesStep Up Your Survey Research - Dawn of the Data Age Lecture Series
Step Up Your Survey Research - Dawn of the Data Age Lecture Series
 
Data Discovery and Visualization
Data Discovery and VisualizationData Discovery and Visualization
Data Discovery and Visualization
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 

Último

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...HyderabadDolls
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?RemarkSemacio
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Availablegargpaaro
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 

Último (20)

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 

Red Blue Presentation

  • 1. Red / Blue Using Machine Learning to Build an Ideologically Balanced News Diet Salil Doshi Sam Goodgame Susan Eun Park Paul Platzman May 21st, 2016
  • 2. May 15th, 2016 -- Six Days Ago... “...Today in every phone in one of your pockets we have access to more information than at any time in human history, at a touch of a button. But, ironically, the flood of information hasn’t made us more discerning of the truth. In some ways, it’s just made us more confident in our ignorance. We assume whatever is on the web must be true. We search for sites that just reinforce our own predispositions.” -President Obama, Rutgers Commencement Address
  • 6. Training Data Ingestion and Wrangling
  • 7. Data Transformation Removed common English words and candidate and moderator names Vectorized the Data Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values Sample TF-IDF Vectorized Matrix:
  • 8. Model Estimators Binary Classification Models: Logistic Regression (LR) Multinomial Naive Bayes (MNB) Support Vector Machine (SVM)
  • 9. Feature Engineering Truncated Singular Value Decomposition (TSVD) Reduced number of features without compromising predictive performance 11,228 features --> 2,000 features No reduction in F-1 Score or Accuracy Score Models with fewer than 2,000 features experienced diminished performance Trend observed across each model form
  • 10. Parameter Tuning: Using Grid Search ● Optimized ‘C’ Value, the penalty parameter ● Maintained generalizability of model to prediction data http://www.intechopen.com/source/html/4 5102/media/image44.png
  • 11. SVM Model Performance Metrics Precision Recall F-1 Score Democratic 0.76 0.58 0.66 Republican 0.86 0.93 0.89 Average/Total 0.83 0.84 0.83 Correct Democratic Incorrect Democratic n=392 n=279 Correct Republican Incorrect Republican n=1693 n=121 Overall Accuracy Rate: 84%
  • 13. Prediction Results: Normalized Spectrum ● 79% of all documents were classified as Republican
  • 14. Prediction Results: Media Source Spectrum
  • 15. Prediction Results vs. Pew Research Center Results
  • 16. Discussion Results don’t match ideological spectrum of audiences. Several potential interpretations: Republican stories dominated news cycles Republican candidates more regularly used pre-existing media language Oral language is not strongly predictive of written language
  • 17. Methodological Self-Evaluation (1) ● Strengths: ○ Expansion of instance set to reduce model performance variation ○ Removal of moderator speech ○ Removal of custom stop words ○ Employed a variety of model forms ○ Reduced feature set size without impeding performance ○ Optimized ‘C’ parameter value
  • 18. Methodological Self-Evaluation (2) ● Shortcomings: ○ RSS feed content was not always ideal or consistent ■ Contained ‘jQuery’ or advertisement placeholders ■ Variety in article length ■ Variable number of instances from each media outlet ○ Single source of training data ○ Uneven distribution of red/blue training data
  • 19. Looking Towards Future Iterations ● Future studies could… ○ Use additional training data sources ○ Encompass prediction data of greater breadth and depth: more news sources and more articles per source ○ Include more feature engineering to account for differently formatted RSS feeds ○ Predict oral political dialogue
  • 20. For Posterity ● Implications for partisanship... ○ The potential virtue of an ideologically balanced diet ○ A shift in media engagement behaviors could promote open-mindedness and compromise ○ This, in turn, could promote legislative functioning

Notas do Editor

  1. Last weekend, President Obama delivered the commencement address at my alma mater, Rutgers University. In it, he alluded to the flood of information that we’ve become increasingly exposed to and the perhaps counterintuitive notion that it has not made us more informed. Instead, he noted, we use the web and social media as a tool to seek out information that reinforces our preexisting beliefs, tune out voices of those who don’t think like us, and amplify voices of those who do. Indeed, America has become increasingly politically polarized during Obama’s tenure and media consumption habits are thought to play a role in this emerging phenomenon.
  2. In 2014, the Pew Research Center measured the ideological placement of audiences of a variety of political news outlets. As you can see, political conservatives and liberals consume different news sources, and each are believed to espouse and reinforce particular philosophies within their readerships. If media consumption differentiation exists, it could presumably influence the political divisiveness that has manifested in, for example, gridlocked government -- the last two U.S. Congresses have been the two least productive historically. So political media content analysis is worthwhile. Past studies have analyzed media content from a variety of angles, such as sentiment analysis, but we sought to evaluate media outlets based on their consistency with language spoken by politicians. Specifically, we asked: to what extent do media outlets’ written articles correspond to Democratic and Republican politicians’ word choices during the 2016 presidential primary debates? Does media language usage vary according to the same spectrum as the political preferences of their respective audiences? If so, could that suggest a link between language choice and political polarization?
  3. Data Product: political language classifier for news articles. High-level overview: Build phase: generate model Pull debate transcripts from Internet Wrangle into text documents Put them into the proper format for analysis Fit classifier Operational Phase: Pull RSS feeds from the Internet Put that data into the same form as our training data Feed it to our model; receive one prediction per text document: “red” for Republican or “blue” for Democrat Now I’m going to drill down into the build phase, and then specifically into the initial data wrangling
  4. More depth for building our model: Start with Debate HTML documents Get them into text format, and into a data bunch Conduct a type of analysis called TF-IDF, which is a weighted measure for word frequency in each document Final data form: sparse matrix (I’ll go into more detail in a moment) Feature engineering: Remove stop words (“the” or “for”) Remove non-predictive features Data is in final form: Evaluate three models: LR, SVM, and MNB Iterative feature engineering and parameter tuning → fitted model
  5. Drilling down into the initial data ingestion and wrangling: Debate transcripts were HTML documents--ugly, with markup like ‘p’ and ‘body’ tags Used BeautifulSoup to parse out the text, then spit out a document that only includes text Data bunch format: particular directory structure compatible with scikit-learn modules
  6. Vectorize data, transform it into weighted term frequency values, and remove stop words with one line of code TF-IDF stands for Term Frequency-Inverse Document Frequency. Scikit-Learn package that determines weights for words End result is a sparse matrix. Any given word appears relatively infrequently, so we have a lot of zeroes
  7. After getting data into proper format, we evaluated these three models: The LR algorithm classifies data by obtaining a best-fit logistic function. The MNB algorithm is a probabilistic classifier that applies Bayes’ theorem; it assumes (naively) that the features in the model are independent. SVM separates categories in data by drawing a separating hyperplane between instances of different classes.
  8. Next, we wanted to make our model more efficient.
  9. Moving forward with SVM as our best model, we used scikit’s learn grid search to conduct parameter tuning. The type of model we used was linear support vector classification, which had a penalty parameter called ‘C’. This parameter controlled how many errors or misclassifications the model would make. We had to be careful in not overfitting our model by trying to minimize the errors according to C, because then it would solely be optimized on training data. We chose a C that gave us larger margins in our hyperplane in our linear model, which gave us the best F-1 Score across both Democratic and Republican data which meant that it was in the best position to predict against outside data that we introduce.
  10. Our optimized SVM model had an overall F-1 score of 0.83. The F1 score is a weighted average of the precision and recall, 1 being the best value and 0 being the worst. You can also see the broken out precision and recall for our Democratic and Republican data. Accuracy rate is 84% High precision and recall for Republican, but this could be attributed to the fact that we had as twice as many Republican training data than Democratic training data, owing to the fact that there were more Republican debates than Democratic. “A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.” Confusion matrix
  11. Goal: get RSS data into the same exact vector format as our training data, then feed it into our model OPML (Outline processor markup language) documents: instructions for pulling specific RSS feeds → Baleen: An automated ingestion service for blogs to construct a corpus for NLP research. → Instantiate separate MongoDB database per news source. Documents are instances, and words are features. html → Transform into text → Feed to model
  12. Here is a graphic imported from Tableau that shows a normalized spectrum of our prediction results. 12 news sources, articles per source ranged between 27-173 As the slide states, 79% of the news articles that we analyzed had language that was more consistent with Republican rhetoric than Democratic rhetoric While most sources fall within one standard deviations from the mean - Washington post and the Nation are outliers
  13. Explain what we’re looking at. Another way of representing the information on the previous slide. Spectrum of absolute values. More uniformly Dem on left, more uniformly Republican on right. As Salil said, among all articles, 79% were classified as more consistent with Rep than Dem. Note that the majority of news sources are clustered together between 76% and 94% Red. You might see that MSNBC, often conventionally assumed to be left wing, is the furthest right. This probably didn’t conform with your expectations.
  14. It also didn’t conform with the Pew Research findings. Here is a comparison of our results and the Pew findings. Although WAPO is about as far left as both scales go, most other sources display no meaningful relationship. So what does this mean?
  15. Why DIDN’T our results match the ideological spectrum of audiences?
  16. Parsing each debate into one document yielded a low sample size for our model, so we re-parsed our debate transcripts to yield one document per paragraph. Next, we removed instances that contained moderators’ remarks - instance engineering Created a list of custom stop words, added onto scikit’s original set, to further strengthen our training data - removal of candidate’s names and moderator names LR, MNB, and SVM - chosen because they were appropriate for the binary classification nature of our analysis by fitting our TFIDF vector to a Truncated Singular Value Decomposition model - Scaled down 2,000 features - the last point before which we observed gradual reductions in model performances. The ‘C’ value represents the misclassification parameter which we so our model wasn’t overly optimized on its ability to correctly fit the training data.
  17. After transforming the HTML we pulled from RSS feeds, we discovered documents with jQuery script tags in addition to journalistic content. Other transformed documents contained solely advertisements or placeholders HTML tags for advertisements. Further, different news sources produced different kinds of RSS feeds. Some long-form with in-depth analysis, others simply contained blurbs that set up a resulting slideshow (not used in our data). Other shortcomings include that we used only debate transcripts as our source for training data and that we had far more republican debates(and candidates) than democratic ones.
  18. First, future studies could include more news sources and many more articles per news source. They could even out the distribution of Republican and Democratic speech in the training set. Third, they could improve feature engineering, specifically regarding transforming data from its organic form into text documents and vectors. Instead of relying solely on debate transcripts for the training data corpus, a future study could use debate transcripts to fit an initial model, then use that model to make predictions about a cross-section of article data, then feed the labeled article data back into the fitted model to strengthen and generalize it.
  19. So going back to the conundrum that Paul outlined in the beginning of the presentation, it’s important to consider what kind of implications that a text classifier built to identify partisan leaning language can have for individual news consumption. If people are choosing the news they read to reinforce the pre-existing beliefs they hold, then it’s worth examining the potential virtue of an ideologically balanced diet. By being conscientious with our news consumption, we could witness a shift in media engagement behaviors that are more open-minded and less entrenched in ideology. Becoming open to compromise and working with the other side could promote legislative functioning