Red Blue Presentation

•Transferir como PPTX, PDF•

2 gostaram•858 visualizações

Lincoln Jackson

Using Machine Learning to Build an Ideologically Balanced News Diet

Dados e análise

May 15th, 2016 -- Six Days Ago...
“...Today in every phone in one of
your pockets we have access to
more information than at any time
in human history, at a touch of a
button. But, ironically, the flood of
information hasn’t made us more
discerning of the truth. In some
ways, it’s just made us more
confident in our ignorance. We
assume whatever is on the web
must be true. We search for sites
that just reinforce our own
predispositions.”
-President Obama, Rutgers Commencement Address

Data Transformation
Removed common English words and candidate and moderator names
Vectorized the Data
Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values
Sample TF-IDF Vectorized Matrix:

Model Estimators
Binary Classification Models:
Logistic Regression (LR)
Multinomial Naive Bayes
(MNB)
Support Vector Machine
(SVM)

Feature Engineering
Truncated Singular Value Decomposition (TSVD)
Reduced number of features without compromising predictive
performance
11,228 features --> 2,000 features
No reduction in F-1 Score or Accuracy Score
Models with fewer than 2,000 features experienced diminished
performance
Trend observed across each model form

Parameter Tuning: Using Grid Search
● Optimized ‘C’ Value, the penalty parameter
● Maintained generalizability of model to prediction data
http://www.intechopen.com/source/html/4
5102/media/image44.png

SVM Model Performance Metrics
Precision Recall F-1 Score
Democratic 0.76 0.58 0.66
Republican 0.86 0.93 0.89
Average/Total 0.83 0.84 0.83
Correct Democratic Incorrect Democratic
n=392 n=279
Correct Republican Incorrect Republican
n=1693 n=121
Overall Accuracy Rate: 84%

Prediction Results: Normalized Spectrum
● 79% of all documents were classified as Republican

Prediction Results: Media Source Spectrum

Prediction Results vs. Pew Research Center Results

Discussion
Results don’t match ideological spectrum of audiences.
Several potential interpretations:
Republican stories dominated news cycles
Republican candidates more regularly used pre-existing
media language
Oral language is not strongly predictive of written
language

Methodological Self-Evaluation (1)
● Strengths:
○ Expansion of instance set to reduce model performance variation
○ Removal of moderator speech
○ Removal of custom stop words
○ Employed a variety of model forms
○ Reduced feature set size without impeding performance
○ Optimized ‘C’ parameter value

Methodological Self-Evaluation (2)
● Shortcomings:
○ RSS feed content was not always ideal or consistent
■ Contained ‘jQuery’ or advertisement placeholders
■ Variety in article length
■ Variable number of instances from each media outlet
○ Single source of training data
○ Uneven distribution of red/blue training data

Looking Towards Future Iterations
● Future studies could…
○ Use additional training data sources
○ Encompass prediction data of greater breadth and
depth: more news sources and more articles per source
○ Include more feature engineering to account for
differently formatted RSS feeds
○ Predict oral political dialogue

For Posterity
● Implications for partisanship...
○ The potential virtue of an ideologically
balanced diet
○ A shift in media engagement behaviors could
promote open-mindedness and compromise
○ This, in turn, could promote legislative
functioning

Mais conteúdo relacionado

Mais procurados

2012 Presidential Elections on Twitter - An Analysis of How the US and French...University Politehnica Bucharest

Trending Topic in Social NetworksComInSyS

Microposts2015 - Social Spam Detection on Twitterazubiaga

Asking Questions of DataTony Hirst

Twitter Analysis: Fake NewsErika Siregar

Reference List Citations - APA 6th EditionJanice Orcutt

Data Cleaning for social media knowledge extractionMarco Brambilla

Presentation-Detecting Spammers on Social NetworksAshish Arora

Data.Mining.C.8(Ii).Web Mining 570802461Margaret Wang

Information retrievalHarry Potter

Using Tweets for Understanding Public Opinion During U.S. Primaries and Predi...Monica Powell

WTFWSuhas Suresh Rao

Pilot to Examine the Potential of Twitter and Facebook in the German Bundesta...haiko_lietz

srd117.final.512Spring2016Saurabh Deochake

Groundhog Day: Near-Duplicate Detection on Twitter Ke Tao

Measuring Opinion Credibility in Twiitermthandar

Stack_Overflow-Network_GraphYaopeng (Gyoho) Wu

Hao lyu slides_sarcasmHao Lyu

GeospatialDataAnalysisTaylor Graham

Metodologia para el analisis de redes socialesMontse Fernández Crespo

Mais procurados (20)

2012 Presidential Elections on Twitter - An Analysis of How the US and French...

Destaque

Capital Bikeshare Presentationdonahuerm

Analysis of differential investor performance captstone presentation finalHoward Ho

Hotel Performance FINALteam_hotelperformance

Georgetown Data Analytics - Team 1 Capstone ProjectMark Phillips

No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...Brittne Kakulla, Ph.D.

Georgetown Data Analytics Project (Team DC)Noah Turner

Destaque (6)

Capital Bikeshare Presentation

Analysis of differential investor performance captstone presentation final

Hotel Performance FINAL

Georgetown Data Analytics - Team 1 Capstone Project

No More Half Fast: Improving US Broadband Download Speed. Georgetown Universi...

Georgetown Data Analytics Project (Team DC)

Semelhante a Red Blue Presentation

Quantitative and Digital Skills of International Journalism and Communication...J T "Tom" Johnson

Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...Stuart Shulman

How NOT to Aggregrate Polling DataDataCards

Tools for (Almost) Real-Time Social Media AnalysisDiana Maynard

Discovery informaticsstantonSyracuse University

PEARC17: Debugging a Biased Student Selection SystemLorna Rivera

AbstractSuresh Prabhu

GATE: a text analysis tool for social mediaDiana Maynard

Human-centered AI: how can we support end-users to interact with AI?Katrien Verbert

Week 11Collection of Data – questionnaire and Instruments & .docxmelbruce90096

Brightfind world usability day 2016 full deck finalBrightfind

6711SafeAssign Originality Report69Total S.docxpriestmanmable

6711SafeAssign Originality Report69Total S.docxsodhi3

[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...DataScienceConferenc1

Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...COST Action TD1210

Human-centered AI: how can we support lay users to understand AI?Katrien Verbert

Step Up Your Survey Research - Dawn of the Data Age Lecture SeriesLuciano Pesci, PhD

Data Discovery and VisualizationDr. Neil Brittliff

Research design decisions and be competent in the process of reliable data co...Stats Statswork

Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2

Semelhante a Red Blue Presentation (20)

Quantitative and Digital Skills of International Journalism and Communication...

Text Analytics: From Colored Pens and Crumbly Papers to Custom Machine Classi...

How NOT to Aggregrate Polling Data

Tools for (Almost) Real-Time Social Media Analysis

Discovery informaticsstanton

PEARC17: Debugging a Biased Student Selection System

Abstract

GATE: a text analysis tool for social media

Human-centered AI: how can we support end-users to interact with AI?

Week 11Collection of Data – questionnaire and Instruments & .docx

Brightfind world usability day 2016 full deck final

6711SafeAssign Originality Report69Total S.docx

[DSC Europe 23] Alen Kisic - How can do Facebook data and machine learning al...

Paul Groth: Data Analysis in a Changing Discourse: The Challenges of Scholarl...

Human-centered AI: how can we support lay users to understand AI?

Step Up Your Survey Research - Dawn of the Data Age Lecture Series

Data Discovery and Visualization

Research design decisions and be competent in the process of reliable data co...

Discover How Scientific Data is Used for the Public Good with Natural Languag...

Último

$Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...$ $Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...$

Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...HyderabadDolls

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940

Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Statistics notes ,it includes mean to index numberssuginr1

Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls

Case Study 4 Where the cry of rebellion happen?RemarkSemacio

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan

Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg

Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...gajnagarg

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation

7. Epi of Chronic respiratory diseases.pptibrahimabdi22

Gartner's Data Analytics Maturity Model.pptxchadhar227

Vastral Call Girls Book Now 7737669865 Top Class Escort Service Availablegargpaaro

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Red Blue Presentation

1. Red / Blue Using Machine Learning to Build an Ideologically Balanced News Diet Salil Doshi Sam Goodgame Susan Eun Park Paul Platzman May 21st, 2016

2. May 15th, 2016 -- Six Days Ago... “...Today in every phone in one of your pockets we have access to more information than at any time in human history, at a touch of a button. But, ironically, the flood of information hasn’t made us more discerning of the truth. In some ways, it’s just made us more confident in our ignorance. We assume whatever is on the web must be true. We search for sites that just reinforce our own predispositions.” -President Obama, Rutgers Commencement Address

3. Pew Research Center April 29, 2014

4. Architecture

5. Build Phase

6. Training Data Ingestion and Wrangling

7. Data Transformation Removed common English words and candidate and moderator names Vectorized the Data Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values Sample TF-IDF Vectorized Matrix:

8. Model Estimators Binary Classification Models: Logistic Regression (LR) Multinomial Naive Bayes (MNB) Support Vector Machine (SVM)

9. Feature Engineering Truncated Singular Value Decomposition (TSVD) Reduced number of features without compromising predictive performance 11,228 features --> 2,000 features No reduction in F-1 Score or Accuracy Score Models with fewer than 2,000 features experienced diminished performance Trend observed across each model form

10. Parameter Tuning: Using Grid Search ● Optimized ‘C’ Value, the penalty parameter ● Maintained generalizability of model to prediction data http://www.intechopen.com/source/html/4 5102/media/image44.png

11. SVM Model Performance Metrics Precision Recall F-1 Score Democratic 0.76 0.58 0.66 Republican 0.86 0.93 0.89 Average/Total 0.83 0.84 0.83 Correct Democratic Incorrect Democratic n=392 n=279 Correct Republican Incorrect Republican n=1693 n=121 Overall Accuracy Rate: 84%

12. Operational Phase

13. Prediction Results: Normalized Spectrum ● 79% of all documents were classified as Republican

14. Prediction Results: Media Source Spectrum

15. Prediction Results vs. Pew Research Center Results

16. Discussion Results don’t match ideological spectrum of audiences. Several potential interpretations: Republican stories dominated news cycles Republican candidates more regularly used pre-existing media language Oral language is not strongly predictive of written language

17. Methodological Self-Evaluation (1) ● Strengths: ○ Expansion of instance set to reduce model performance variation ○ Removal of moderator speech ○ Removal of custom stop words ○ Employed a variety of model forms ○ Reduced feature set size without impeding performance ○ Optimized ‘C’ parameter value

18. Methodological Self-Evaluation (2) ● Shortcomings: ○ RSS feed content was not always ideal or consistent ■ Contained ‘jQuery’ or advertisement placeholders ■ Variety in article length ■ Variable number of instances from each media outlet ○ Single source of training data ○ Uneven distribution of red/blue training data

19. Looking Towards Future Iterations ● Future studies could… ○ Use additional training data sources ○ Encompass prediction data of greater breadth and depth: more news sources and more articles per source ○ Include more feature engineering to account for differently formatted RSS feeds ○ Predict oral political dialogue

20. For Posterity ● Implications for partisanship... ○ The potential virtue of an ideologically balanced diet ○ A shift in media engagement behaviors could promote open-mindedness and compromise ○ This, in turn, could promote legislative functioning

21. Questions?

Notas do Editor

Last weekend, President Obama delivered the commencement address at my alma mater, Rutgers University. In it, he alluded to the flood of information that we’ve become increasingly exposed to and the perhaps counterintuitive notion that it has not made us more informed. Instead, he noted, we use the web and social media as a tool to seek out information that reinforces our preexisting beliefs, tune out voices of those who don’t think like us, and amplify voices of those who do. Indeed, America has become increasingly politically polarized during Obama’s tenure and media consumption habits are thought to play a role in this emerging phenomenon.
In 2014, the Pew Research Center measured the ideological placement of audiences of a variety of political news outlets. As you can see, political conservatives and liberals consume different news sources, and each are believed to espouse and reinforce particular philosophies within their readerships. If media consumption differentiation exists, it could presumably influence the political divisiveness that has manifested in, for example, gridlocked government -- the last two U.S. Congresses have been the two least productive historically. So political media content analysis is worthwhile. Past studies have analyzed media content from a variety of angles, such as sentiment analysis, but we sought to evaluate media outlets based on their consistency with language spoken by politicians. Specifically, we asked: to what extent do media outlets’ written articles correspond to Democratic and Republican politicians’ word choices during the 2016 presidential primary debates? Does media language usage vary according to the same spectrum as the political preferences of their respective audiences? If so, could that suggest a link between language choice and political polarization?
Data Product: political language classifier for news articles. High-level overview: Build phase: generate model Pull debate transcripts from Internet Wrangle into text documents Put them into the proper format for analysis Fit classifier Operational Phase: Pull RSS feeds from the Internet Put that data into the same form as our training data Feed it to our model; receive one prediction per text document: “red” for Republican or “blue” for Democrat Now I’m going to drill down into the build phase, and then specifically into the initial data wrangling
More depth for building our model: Start with Debate HTML documents Get them into text format, and into a data bunch Conduct a type of analysis called TF-IDF, which is a weighted measure for word frequency in each document Final data form: sparse matrix (I’ll go into more detail in a moment) Feature engineering: Remove stop words (“the” or “for”) Remove non-predictive features Data is in final form: Evaluate three models: LR, SVM, and MNB Iterative feature engineering and parameter tuning → fitted model
Drilling down into the initial data ingestion and wrangling: Debate transcripts were HTML documents--ugly, with markup like ‘p’ and ‘body’ tags Used BeautifulSoup to parse out the text, then spit out a document that only includes text Data bunch format: particular directory structure compatible with scikit-learn modules
Vectorize data, transform it into weighted term frequency values, and remove stop words with one line of code TF-IDF stands for Term Frequency-Inverse Document Frequency. Scikit-Learn package that determines weights for words End result is a sparse matrix. Any given word appears relatively infrequently, so we have a lot of zeroes
After getting data into proper format, we evaluated these three models: The LR algorithm classifies data by obtaining a best-fit logistic function. The MNB algorithm is a probabilistic classifier that applies Bayes’ theorem; it assumes (naively) that the features in the model are independent. SVM separates categories in data by drawing a separating hyperplane between instances of different classes.
Next, we wanted to make our model more efficient.
Moving forward with SVM as our best model, we used scikit’s learn grid search to conduct parameter tuning. The type of model we used was linear support vector classification, which had a penalty parameter called ‘C’. This parameter controlled how many errors or misclassifications the model would make. We had to be careful in not overfitting our model by trying to minimize the errors according to C, because then it would solely be optimized on training data. We chose a C that gave us larger margins in our hyperplane in our linear model, which gave us the best F-1 Score across both Democratic and Republican data which meant that it was in the best position to predict against outside data that we introduce.
Our optimized SVM model had an overall F-1 score of 0.83. The F1 score is a weighted average of the precision and recall, 1 being the best value and 0 being the worst. You can also see the broken out precision and recall for our Democratic and Republican data. Accuracy rate is 84% High precision and recall for Republican, but this could be attributed to the fact that we had as twice as many Republican training data than Democratic training data, owing to the fact that there were more Republican debates than Democratic. “A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.” Confusion matrix
Goal: get RSS data into the same exact vector format as our training data, then feed it into our model OPML (Outline processor markup language) documents: instructions for pulling specific RSS feeds → Baleen: An automated ingestion service for blogs to construct a corpus for NLP research. → Instantiate separate MongoDB database per news source. Documents are instances, and words are features. html → Transform into text → Feed to model
Here is a graphic imported from Tableau that shows a normalized spectrum of our prediction results. 12 news sources, articles per source ranged between 27-173 As the slide states, 79% of the news articles that we analyzed had language that was more consistent with Republican rhetoric than Democratic rhetoric While most sources fall within one standard deviations from the mean - Washington post and the Nation are outliers
Explain what we’re looking at. Another way of representing the information on the previous slide. Spectrum of absolute values. More uniformly Dem on left, more uniformly Republican on right. As Salil said, among all articles, 79% were classified as more consistent with Rep than Dem. Note that the majority of news sources are clustered together between 76% and 94% Red. You might see that MSNBC, often conventionally assumed to be left wing, is the furthest right. This probably didn’t conform with your expectations.
It also didn’t conform with the Pew Research findings. Here is a comparison of our results and the Pew findings. Although WAPO is about as far left as both scales go, most other sources display no meaningful relationship. So what does this mean?
Why DIDN’T our results match the ideological spectrum of audiences?
Parsing each debate into one document yielded a low sample size for our model, so we re-parsed our debate transcripts to yield one document per paragraph. Next, we removed instances that contained moderators’ remarks - instance engineering Created a list of custom stop words, added onto scikit’s original set, to further strengthen our training data - removal of candidate’s names and moderator names LR, MNB, and SVM - chosen because they were appropriate for the binary classification nature of our analysis by fitting our TFIDF vector to a Truncated Singular Value Decomposition model - Scaled down 2,000 features - the last point before which we observed gradual reductions in model performances. The ‘C’ value represents the misclassification parameter which we so our model wasn’t overly optimized on its ability to correctly fit the training data.
After transforming the HTML we pulled from RSS feeds, we discovered documents with jQuery script tags in addition to journalistic content. Other transformed documents contained solely advertisements or placeholders HTML tags for advertisements. Further, different news sources produced different kinds of RSS feeds. Some long-form with in-depth analysis, others simply contained blurbs that set up a resulting slideshow (not used in our data). Other shortcomings include that we used only debate transcripts as our source for training data and that we had far more republican debates(and candidates) than democratic ones.
First, future studies could include more news sources and many more articles per news source. They could even out the distribution of Republican and Democratic speech in the training set. Third, they could improve feature engineering, specifically regarding transforming data from its organic form into text documents and vectors. Instead of relying solely on debate transcripts for the training data corpus, a future study could use debate transcripts to fit an initial model, then use that model to make predictions about a cross-section of article data, then feed the labeled article data back into the fitted model to strengthen and generalize it.
So going back to the conundrum that Paul outlined in the beginning of the presentation, it’s important to consider what kind of implications that a text classifier built to identify partisan leaning language can have for individual news consumption. If people are choosing the news they read to reinforce the pre-existing beliefs they hold, then it’s worth examining the potential virtue of an ideologically balanced diet. By being conscientious with our news consumption, we could witness a shift in media engagement behaviors that are more open-minded and less entrenched in ideology. Becoming open to compromise and working with the other side could promote legislative functioning

Red Blue Presentation

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a Red Blue Presentation

Semelhante a Red Blue Presentation (20)

Último

Último (20)

Red Blue Presentation

Notas do Editor