Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Red Blue Presentation
1. Red / Blue
Using Machine Learning to Build an
Ideologically Balanced News Diet
Salil Doshi
Sam Goodgame
Susan Eun Park
Paul Platzman
May 21st, 2016
2. May 15th, 2016 -- Six Days Ago...
“...Today in every phone in one of
your pockets we have access to
more information than at any time
in human history, at a touch of a
button. But, ironically, the flood of
information hasn’t made us more
discerning of the truth. In some
ways, it’s just made us more
confident in our ignorance. We
assume whatever is on the web
must be true. We search for sites
that just reinforce our own
predispositions.”
-President Obama, Rutgers Commencement Address
7. Data Transformation
Removed common English words and candidate and moderator names
Vectorized the Data
Computed Term Frequency-Inverse Document Frequency (TF-IDF) Values
Sample TF-IDF Vectorized Matrix:
9. Feature Engineering
Truncated Singular Value Decomposition (TSVD)
Reduced number of features without compromising predictive
performance
11,228 features --> 2,000 features
No reduction in F-1 Score or Accuracy Score
Models with fewer than 2,000 features experienced diminished
performance
Trend observed across each model form
10. Parameter Tuning: Using Grid Search
● Optimized ‘C’ Value, the penalty parameter
● Maintained generalizability of model to prediction data
http://www.intechopen.com/source/html/4
5102/media/image44.png
16. Discussion
Results don’t match ideological spectrum of audiences.
Several potential interpretations:
Republican stories dominated news cycles
Republican candidates more regularly used pre-existing
media language
Oral language is not strongly predictive of written
language
17. Methodological Self-Evaluation (1)
● Strengths:
○ Expansion of instance set to reduce model performance variation
○ Removal of moderator speech
○ Removal of custom stop words
○ Employed a variety of model forms
○ Reduced feature set size without impeding performance
○ Optimized ‘C’ parameter value
18. Methodological Self-Evaluation (2)
● Shortcomings:
○ RSS feed content was not always ideal or consistent
■ Contained ‘jQuery’ or advertisement placeholders
■ Variety in article length
■ Variable number of instances from each media outlet
○ Single source of training data
○ Uneven distribution of red/blue training data
19. Looking Towards Future Iterations
● Future studies could…
○ Use additional training data sources
○ Encompass prediction data of greater breadth and
depth: more news sources and more articles per source
○ Include more feature engineering to account for
differently formatted RSS feeds
○ Predict oral political dialogue
20. For Posterity
● Implications for partisanship...
○ The potential virtue of an ideologically
balanced diet
○ A shift in media engagement behaviors could
promote open-mindedness and compromise
○ This, in turn, could promote legislative
functioning
Last weekend, President Obama delivered the commencement address at my alma mater, Rutgers University. In it, he alluded to the flood of information that we’ve become increasingly exposed to and the perhaps counterintuitive notion that it has not made us more informed. Instead, he noted, we use the web and social media as a tool to seek out information that reinforces our preexisting beliefs, tune out voices of those who don’t think like us, and amplify voices of those who do.
Indeed, America has become increasingly politically polarized during Obama’s tenure and media consumption habits are thought to play a role in this emerging phenomenon.
In 2014, the Pew Research Center measured the ideological placement of audiences of a variety of political news outlets. As you can see, political conservatives and liberals consume different news sources, and each are believed to espouse and reinforce particular philosophies within their readerships.
If media consumption differentiation exists, it could presumably influence the political divisiveness that has manifested in, for example, gridlocked government -- the last two U.S. Congresses have been the two least productive historically. So political media content analysis is worthwhile.
Past studies have analyzed media content from a variety of angles, such as sentiment analysis, but we sought to evaluate media outlets based on their consistency with language spoken by politicians. Specifically, we asked: to what extent do media outlets’ written articles correspond to Democratic and Republican politicians’ word choices during the 2016 presidential primary debates? Does media language usage vary according to the same spectrum as the political preferences of their respective audiences? If so, could that suggest a link between language choice and political polarization?
Data Product: political language classifier for news articles.
High-level overview:
Build phase: generate model
Pull debate transcripts from Internet
Wrangle into text documents
Put them into the proper format for analysis
Fit classifier
Operational Phase:
Pull RSS feeds from the Internet
Put that data into the same form as our training data
Feed it to our model; receive one prediction per text document: “red” for Republican or “blue” for Democrat
Now I’m going to drill down into the build phase, and then specifically into the initial data wrangling
More depth for building our model:
Start with Debate HTML documents
Get them into text format, and into a data bunch
Conduct a type of analysis called TF-IDF, which is a weighted measure for word frequency in each document
Final data form: sparse matrix (I’ll go into more detail in a moment)
Feature engineering:
Remove stop words (“the” or “for”)
Remove non-predictive features
Data is in final form:
Evaluate three models: LR, SVM, and MNB
Iterative feature engineering and parameter tuning → fitted model
Drilling down into the initial data ingestion and wrangling:
Debate transcripts were HTML documents--ugly, with markup like ‘p’ and ‘body’ tags
Used BeautifulSoup to parse out the text, then spit out a document that only includes text
Data bunch format: particular directory structure compatible with scikit-learn modules
Vectorize data, transform it into weighted term frequency values, and remove stop words with one line of code
TF-IDF stands for Term Frequency-Inverse Document Frequency. Scikit-Learn package that determines weights for words
End result is a sparse matrix. Any given word appears relatively infrequently, so we have a lot of zeroes
After getting data into proper format, we evaluated these three models:
The LR algorithm classifies data by obtaining a best-fit logistic function.
The MNB algorithm is a probabilistic classifier that applies Bayes’ theorem; it assumes (naively) that the features in the model are independent.
SVM separates categories in data by drawing a separating hyperplane between instances of different classes.
Next, we wanted to make our model more efficient.
Moving forward with SVM as our best model, we used scikit’s learn grid search to conduct parameter tuning. The type of model we used was linear support vector classification, which had a penalty parameter called ‘C’. This parameter controlled how many errors or misclassifications the model would make. We had to be careful in not overfitting our model by trying to minimize the errors according to C, because then it would solely be optimized on training data. We chose a C that gave us larger margins in our hyperplane in our linear model, which gave us the best F-1 Score across both Democratic and Republican data which meant that it was in the best position to predict against outside data that we introduce.
Our optimized SVM model had an overall F-1 score of 0.83. The F1 score is a weighted average of the precision and recall, 1 being the best value and 0 being the worst. You can also see the broken out precision and recall for our Democratic and Republican data.
Accuracy rate is 84%
High precision and recall for Republican, but this could be attributed to the fact that we had as twice as many Republican training data than Democratic training data, owing to the fact that there were more Republican debates than Democratic.
“A system with high recall but low precision returns many results, but most of its predicted labels are incorrect when compared to the training labels. A system with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the training labels.”
Confusion matrix
Goal: get RSS data into the same exact vector format as our training data, then feed it into our model
OPML (Outline processor markup language) documents: instructions for pulling specific RSS feeds →
Baleen: An automated ingestion service for blogs to construct a corpus for NLP research. →
Instantiate separate MongoDB database per news source. Documents are instances, and words are features.
html →
Transform into text →
Feed to model
Here is a graphic imported from Tableau that shows a normalized spectrum of our prediction results.
12 news sources, articles per source ranged between 27-173
As the slide states, 79% of the news articles that we analyzed had language that was more consistent with Republican rhetoric than Democratic rhetoric
While most sources fall within one standard deviations from the mean - Washington post and the Nation are outliers
Explain what we’re looking at. Another way of representing the information on the previous slide. Spectrum of absolute values. More uniformly Dem on left, more uniformly Republican on right.
As Salil said, among all articles, 79% were classified as more consistent with Rep than Dem. Note that the majority of news sources are clustered together between 76% and 94% Red. You might see that MSNBC, often conventionally assumed to be left wing, is the furthest right. This probably didn’t conform with your expectations.
It also didn’t conform with the Pew Research findings. Here is a comparison of our results and the Pew findings. Although WAPO is about as far left as both scales go, most other sources display no meaningful relationship. So what does this mean?
Why DIDN’T our results match the ideological spectrum of audiences?
Parsing each debate into one document yielded a low sample size for our model, so we re-parsed our debate transcripts to yield one document per paragraph.
Next, we removed instances that contained moderators’ remarks - instance engineering
Created a list of custom stop words, added onto scikit’s original set, to further strengthen our training data - removal of candidate’s names and moderator names
LR, MNB, and SVM - chosen because they were appropriate for the binary classification nature of our analysis
by fitting our TFIDF vector to a Truncated Singular Value Decomposition model - Scaled down 2,000 features - the last point before which we observed gradual reductions in model performances.
The ‘C’ value represents the misclassification parameter which we so our model wasn’t overly optimized on its ability to correctly fit the training data.
After transforming the HTML we pulled from RSS feeds, we discovered documents with jQuery script tags in addition to journalistic content.
Other transformed documents contained solely advertisements or placeholders HTML tags for advertisements.
Further, different news sources produced different kinds of RSS feeds. Some long-form with in-depth analysis, others simply contained blurbs that set up a resulting slideshow (not used in our data).
Other shortcomings include that we used only debate transcripts as our source for training data and that we had far more republican debates(and candidates) than democratic ones.
First, future studies could include more news sources and many more articles per news source.
They could even out the distribution of Republican and Democratic speech in the training set.
Third, they could improve feature engineering, specifically regarding transforming data from its organic form into text documents and vectors.
Instead of relying solely on debate transcripts for the training data corpus, a future study could use debate transcripts to fit an initial model, then use that model to make predictions about a cross-section of article data, then feed the labeled article data back into the fitted model to strengthen and generalize it.
So going back to the conundrum that Paul outlined in the beginning of the presentation, it’s important to consider what kind of implications that a text classifier built to identify partisan leaning language can have for individual news consumption. If people are choosing the news they read to reinforce the pre-existing beliefs they hold, then it’s worth examining the potential virtue of an ideologically balanced diet. By being conscientious with our news consumption, we could witness a shift in media engagement behaviors that are more open-minded and less entrenched in ideology. Becoming open to compromise and working with the other side could promote legislative functioning