Anúncio

Grammar of truth and lies

20 de Feb de 2020
Anúncio

Mais conteúdo relacionado

Similar a Grammar of truth and lies(20)

Anúncio

Grammar of truth and lies

  1. The Grammar of Truth and Lies Using NLP to detect Fake News Peter J Bleackley Playful Technology Limited peter.bleackley@playfultechnology.co.uk
  2. The Problem ● “A lie can run around the world before the truth can get its boots on.” ● Fake News spreads six times faster than real news on Twitter ● The spread of true and false news online, Sorush Vosougi, Deb Roy, Sinan Aral, Science, Vol. 359, Issue 6380, pp. 1146-1151, 9th March 2018 ● https://science.sciencemag.org/content/359/6380/1146
  3. The Data ● “Getting Real about Fake News” Kaggle Dataset ● https://www.kaggle.com/mrisdal/fake-news ● 12999 articles from sites flagged as unreliable by the BS Detector chrome extension ● Reuters-21578, Distribution 1.0 Corpus ● 10000 articles from Reuters Newswire, 1987 ● http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html ● Available from NLTK
  4. Don’t Use Vocabulary! ● Potential for bias, especially as corpora are from different time periods ● Difficult to generalise ● Could be reverse-engineered by a bad actor
  5. Sentence structure features ● Perform Part of Speech tagging with TextBlob ● Concatenate tags to form a feature for each sentence ● “Pete Bleackley is a self-employed data scientist and computational linguist.” ● 'NNP_NNP_VBZ_DT_JJ_NNS_NN_CC_JJ_NN' ● Very large, very sparse feature set
  6. First model ● Train LSI model (Gensim) on sentence structure features from whole dataset ● 70/30 split between training and test data ● Sentence structure features => LSI => Logistic Regression (scikit-learn) ● https://www.kaggle.com/petebleackley/the-grammar-of-trut h-and-lies
  7. Performance ● Precision 61% ● Recall 96% ● Accuracy 70% ● Matthews Correlation Coefficient 50% ● Precision measures our ability to catch the bad guys.
  8. Sentiment analysis ● Used VADER model in NLTK ● Produces Positive, Negative and Neutral scores for each sentence ● Sum over document ● Precision 71%, Recall 88%, Accuracy 79%, Matthews 59%
  9. Sentence Structure + Sentiments ● Precision 74% ● Recall 90% ● Accuracy 81% ● Matthews 64% ● Slight improvement, but it looks like sentiment is doing most of the work
  10. Random Forests Precision Recall Accuracy Matthews Sentence structure 83% 89% 86% 71% Sentiments 75% 75% 78% 76% Both 84% 89% 87% 76%
  11. Understanding the models ● Out of 333264 sentence structure features, 298332 occur only in a single document ● Out of 23000 documents, 11276 have no features in common with others ● We need some denser features
  12. Function words ● Pronouns, prepositions, conjunctions, auxilliaries ● Present in every document – most common words ● Usually discarded as “stopwords”... ● ...but useful for stylometric analysis, eg document attribution ● NLTK stopwords corpus
  13. New model ● Sentence structure features + function words => LSI => Logistic Regression ● Precision 90% ● Recall 96% ● Accuracy 93% ● Matthews 87%
  14. What have we learnt? ● Grammatical and stylistic features can be used to distinguish between real and fake news ● Good choice of features is the key to success ● Will this generalise to other sources?
  15. See also... ● The (mis)informed citizen ● Alan Turing Institute project ● https://www.turing.ac.uk/research/research-projects/misinf ormed-citizen
Anúncio