The Grammar of Truth and Lies
Using NLP to detect Fake News
Peter J Bleackley
Playful Technology Limited
peter.bleackley@playfultechnology.co.uk
The Problem
●
“A lie can run around the world before the truth can get its
boots on.”
●
Fake News spreads six times faster than real news on Twitter
●
The spread of true and false news online, Sorush Vosougi,
Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp.
1146-1151, 9th March 2018
●
https://science.sciencemag.org/content/359/6380/1146
The Data
●
“Getting Real about Fake News” Kaggle Dataset
●
https://www.kaggle.com/mrisdal/fake-news
●
12999 articles from sites flagged as unreliable by the BS Detector
chrome extension
●
Reuters-21578, Distribution 1.0 Corpus
●
10000 articles from Reuters Newswire, 1987
●
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
●
Available from NLTK
Don’t Use Vocabulary!
●
Potential for bias, especially as corpora are from different
time periods
●
Difficult to generalise
●
Could be reverse-engineered by a bad actor
Sentence structure features
●
Perform Part of Speech tagging with TextBlob
●
Concatenate tags to form a feature for each sentence
●
“Pete Bleackley is a self-employed data scientist and
computational linguist.”
●
'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN'
●
Very large, very sparse feature set
First model
●
Train LSI model (Gensim) on sentence structure features
from whole dataset
●
70/30 split between training and test data
●
Sentence structure features => LSI => Logistic Regression
(scikit-learn)
●
https://www.kaggle.com/petebleackley/the-grammar-of-trut
h-and-lies
Sentiment analysis
●
Used VADER model in NLTK
●
Produces Positive, Negative and Neutral scores for each
sentence
●
Sum over document
●
Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
Sentence Structure + Sentiments
●
Precision 74%
●
Recall 90%
●
Accuracy 81%
●
Matthews 64%
●
Slight improvement, but it looks like sentiment is doing
most of the work
Understanding the models
●
Out of 333264 sentence structure features, 298332 occur
only in a single document
●
Out of 23000 documents, 11276 have no features in
common with others
●
We need some denser features
Function words
●
Pronouns, prepositions, conjunctions, auxilliaries
●
Present in every document – most common words
●
Usually discarded as “stopwords”...
●
...but useful for stylometric analysis, eg document
attribution
●
NLTK stopwords corpus
New model
●
Sentence structure features + function words => LSI =>
Logistic Regression
●
Precision 90%
●
Recall 96%
●
Accuracy 93%
●
Matthews 87%
What have we learnt?
●
Grammatical and stylistic features can be used to
distinguish between real and fake news
●
Good choice of features is the key to success
●
Will this generalise to other sources?
See also...
●
The (mis)informed citizen
●
Alan Turing Institute project
●
https://www.turing.ac.uk/research/research-projects/misinf
ormed-citizen