The document describes a project that aims to measure the interestingness of articles by analyzing tweets related to entities within the articles. It extracts entities from articles, identifies dominant entities, mines tweets related to those entities, categorizes tweets as positive, negative, or neutral, and predicts interestingness based on the number of positive and negative tweets. Challenges include collecting a suitable test article set, choosing a Twitter dataset and parameters for tweet categorization, and determining the appropriate algorithm to measure interestingness based on positive and negative tweets. In conclusion, social media like Twitter can be leveraged to predict the nature of web data and suggest articles to users.
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Predict Interestingness of An Article Using Twitter
1. Predict the Interesting of an article
Using Twitter
Chitra khatwani
Yashasvi girdhar
Khyati chandu
R.K. Srinivas
2. The project aims at measuring the
interestingness of articles by analyzing the
tweets related to the entities in the article.
●
Application:
– We can order the articles for a search query
according to their interestingness.
– Suggesting news articles to users on websites
3. Approach Followed
●
Extract all the named entities from the article
> Two methods can be followed
●
Using NLTK Library
●
Using A list of Wikipedia Titles
We have used the second approach, because the nltk
library misses out many important entities, in some
cases.
4. Approach Followed
●
Shortlist all the dominant entities from the
extracted entities
– Dominant entities are those, which are most
frequently talked about in the article.
– Methods:
●
Can be decided based on the frequency of entities
●
Entities occurring in the title of the article
5. Approach Followed
●
Mine all the tweets related to all the dominant
entities
●
Done using Twitter Search API
●
Need to collect the tweets of the entities, around the date
when the article was published.
●
Need to parse the tweets before storing them, to make
thhem ready for the next steps.
6. Approach Followed
●
Categorize each tweet as +ve , -ve or neutral
– Consider all the unigrams tokens equally
– Score each token using the naive bayes formula
– Sum up the scores of all the tokens to calculate the
score for an entitiy
7. Approach Followed
●
Predict the interestingness of the article, using
the number of positive and negative tweets
We have followed the below approach :
– Less is the difference between number of positive
tweets and number of negative tweets, more is the
interestingness of the article.
– On the other hand, if the number of positive entities
outweighs the number of negative entities, or vice-
versa, the article is considered less interesting.
8. Datasets used
●
For Articles
– A set of random news articles taken from the BBC
News Dataset
●
For Sentiment Analysis
– Mejaj Dataset
●
Built on the basis of categorizing tweets on the basis of
predefined list of positive and negative words
– Standford Dataset
9. Challenges
●
Collecting the right set of articles for testing our
model
●
Finding the Right dataset for twitter and then,
deciding upon the parameters, to categorize the
tweet
●
Deciding upon the appropriate algorithm for
deciding the interestingness of the article,
based on the +ve and -ve tweets
10. Conclusion
●
Social Media, such as twitter in this case, is a
very common medium for people nowadays, to
express their opinions about something.
This can be leveraged as a very powerful
medium, in predicting the nature of the data
published on the web, specially millions of
articles that are published each day.
This can also be used in suggesting the articles
to the users.
References
●
Mining Sentiments from Tweets, Siel, IIIT-Hyderabad