The document discusses the thesis defense of Prateek Dewan regarding techniques for automating quality assessment of context-specific content on social media services. The thesis aims to design and evaluate automated techniques to assess quality of content on Facebook in real time, with a focus on identifying poor quality posts, pages publishing poor content, and misinformation spread through images. The approach involves characterizing ground truths, modeling with machine learning, and implementing a system called Facebook Inspector. Key findings include that 65% of poor quality posts remained on Facebook for over 4 months, and images were found to contain more negative sentiment than accompanying text.
2. Who am I?
• Data Scientist at Apple
• PhD student since February, 2012 – IIIT-Delhi
• Masters (2010 – 2012), IIIT-Delhi
• Collaborations
• IBM IRL (Delhi and Bengaluru), Symantec Research Labs (Pune), Dublin City
University (Ireland), UFMG (Brazil)
• Worked in Privacy and Security on Online Social Media
• Research interests
• Applied Machine Learning
• Natural Language Processing
• Web Security
2
9. Approach
•Poor quality posts published on Facebook
•Facebook pages publishing poor quality content
•Misinformation spread on Facebook through images
Characterize
•Ground truth extraction using URL blacklists, and human annotation
•Experiments with multiple supervised learning techniques
•Two-fold model to identify malicious content in real time
Model
•Facebook Inspector (FbI) Architecture
•Live deployment via REST API and browser plug-ins for Chrome and
Firefox
•3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed
•Evaluation in terms of response time, performance, and usability
Implement
9
10. Approach
• Poor quality posts published on Facebook
•Facebook pages publishing poor quality content
•Misinformation spread on Facebook through images
Characterize
•Ground truth extraction using URL blacklists, and human annotation
•Experiments with multiple supervised learning techniques
•Two-fold model to identify malicious content in real time
Model
•Facebook Inspector (FbI) Architecture
•Live deployment via REST API and browser plug-ins for Chrome and
Firefox
•3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed
•Evaluation in terms of response time, performance, and usability
Implement
10
11. Dataset
Data Type Quantity
Unique posts 4,465,371
Unique entities 3,373,953
Unique users 2,983,707
Unique pages 390,246
Unique URLs 480,407
Unique posts with one or more URLs 1,222,137
Unique entities posting URLs 856,758
Unique posts with one or more malicious URLs 11,217
Unique entities posting one or more malicious URLs 7,962
Unique malicious URLs 4,622
11
17. Approach
•Poor quality posts published on Facebook
• Facebook pages publishing poor quality content
•Misinformation spread on Facebook through images
Characterize
•Ground truth extraction using URL blacklists, and human annotation
•Experiments with multiple supervised learning techniques
•Two-fold model to identify malicious content in real time
Model
•Facebook Inspector (FbI) Architecture
•Live deployment via REST API and browser plug-ins for Chrome and
Firefox
•3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed
•Evaluation in terms of response time, performance, and usability
Implement
17
20. Dataset of pages posting poor quality content
WOT response No. of pages No. of posts
Child unsafe 387 10,891
Untrustworthy 317 8,057
Questionable 312 8,859
Negative 266 5,863
Adult content 162 3,290
Spam 124 4,985
Phishing 39 495
Total 627 (31) 20,999
20
• Numbers in brackets are Verified pages
21. Content analysis (page names)
21
• Sentence Tokenization à Word Tokenization à Case normalization à
Stemming à Stopword removal
• N-gram analysis (n = 1, 2, 3)
• Politically polarized entities amongst poor quality pages
• British National Party (BNP), The Tea Party, English Defense League,
American Defense League, American Conservatives, Geert Wilders
supporters…
24. Approach
•Poor quality posts published on Facebook
•Facebook pages publishing poor quality content
• Misinformation spread on Facebook through images
Characterize
•Ground truth extraction using URL blacklists, and human annotation
•Experiments with multiple supervised learning techniques
•Two-fold model to identify malicious content in real time
Model
•Facebook Inspector (FbI) Architecture
•Live deployment via REST API and browser plug-ins for Chrome and
Firefox
•3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed
•Evaluation in terms of response time, performance, and usability
Implement
24
26. Are we doing enough to "understand" images?
• Most research to analyze social media content focuses on text
• Topic modelling
• Sentiment analysis
• Does it capture everything?
• Studies related to images are limited to small scale
• Few hundred images manually annotated and analyzed
• What can be done?
• Automated techniques for image summarization; Deep Learning and
Convolutional Neural Networks (CNNs) to scale across large no. of images
• Domain transfer learning
• Optical Character Recognition
26
29. Tier I: Visual Themes contd.
• All images labeled using Inception-v3
• Validation:
• Random sample of 2,545 images annotated by 3 human annotators
• 38.87% accuracy (majority voting)
• Manual calibration
• Renamed 7 out of the top 30 (most frequently occurring) labels
• New accuracy: 51.3%
• Why rename? à
29
Bolo Tie
(Inception-v3)
PeaceForParis
(Our dataset)
31. Tier III: Text embedded in images
• Optical Character
Recognition (OCR)
• Tesseract OCR (Python)
• 31,689 images had text
• Manually extracted text
from a random sample of
1,000 images
• Compared with OCR
output using string
similarity metrics
• ~62% accuracy
31
Tesseract output:
No-one thinks that
these people are
representative of
Christians. So why
do so many think
that these people
are representative
of Muslims?
35. Approach
•Poor quality posts published on Facebook
•Facebook pages publishing poor quality content
•Misinformation spread on Facebook through images
Characterize
•Ground truth extraction using URL blacklists, and human annotation
•Experiments with multiple supervised learning techniques
•Two-fold model to identify malicious content in real time
Model
•Facebook Inspector (FbI) Architecture
•Live deployment via REST API and browser plug-ins for Chrome and
Firefox
•3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed
•Evaluation in terms of response time, performance, and usability
Implement
35
36. Revisiting -- Establishing Ground Truth
• Extracted posts containing one or more URLs
• 1.2 million out of 4.4 million posts in total
• 480k unique URLs
• Used six URL blacklists
• Google Safebrowsing(malware / phishing)
• VirusTotal (spam / malware / phishing)
• Surbl (spam)
• Web of Trust (trust score)*
• SpamHaus (spam)
• Phishtank(phishing)
• Post containing one or more blacklisted URL marked as poor
quality posts (11,217 in all)
36
37. Ground Truth extraction – Dataset II
•What if a post does not have a URL?
• 500 random Facebook posts x 17 events x 3 annotators
• Definition of malicious post
• “Any irrelevant or unsolicited messages sent over the Internet, typically to large
numbers of users, for the purposes of advertising, phishing, spreading malware, etc.
are categorized as spam. In terms of online social media, social spam is any content
which is irrelevant / unrelated to the event under consideration, and / or aimed at
spreading phishing, malware, advertisements, self promotion etc., including bulk
messages, profanity, insults, hate speech, malicious links, fraudulent reviews, scams,
fake information etc.”
• Final dataset (all 3 annotators agreed on the same label)
• 571 malicious posts
• 3,841 benign posts
37
46. Approach
•Poor quality posts published on Facebook
•Facebook pages publishing poor quality content
•Misinformation spread on Facebook through images
Characterize
•Ground truth extraction using URL blacklists, and human annotation
•Experiments with multiple supervised learning techniques
•Two-fold model to identify malicious content in real time
Model
•Facebook Inspector (FbI) Architecture
•Live deployment via REST API and browser plug-ins for Chrome and
Firefox
•3,000+ downloads, 180+ daily active users, 1 million+ posts analyzed
•Evaluation in terms of response time, performance, and usability
Implement
46
48. FbI stats
Date of public launch August 23, 2015
Total Incoming Requests 9 million +
Total public posts analyzed 3.5 million +
Total downloads 5,000+
Daily active users 250+
Total unique browsers 1,250+
Posts marked as malicious 615,000+
Posts marked as benign 2.9 million+
48
50. FbI evaluation: Usability
• Usability study with 53 participants
• SUS score: 81.36 (A grade)
• Higher perceived usability that > 90% of all systems evaluated using
SUS scale
• 98.1% participants found FbI “easy to use”
• 67.9% participants would like use FbI frequently
• Quotes from users:
• “Saves your time spent on spam links and hence enhances user
experience.”
• “[Facebook Inspector] Can be useful for minors and people who lack
the judgement to decide how the post is.”
50
55. Acknowledgements
• NIXI for travel support (eCRS, 2014)
• IIIT-Delhi for travel support (ASONAM, 2017)
• Govt. of India for funding during PhD
• Collaborators and co-authors: Dr. Anand Kashyap, Shrey Bagroy,
Anshuman Suri, Varun Bharadhwaj, Aditi Mithal
• Monitoring committee: Dr. Vinayak and Dr. Sambuddho
• Peers: Dr. Niharika Sachdeva, Anupama Aggarwal, Dr. Paridhi Jain,
Dr. Aditi Gupta, Srishti Gupta, Rishabh Kaushal
• Members of Precog@IIITD and CERC
• Everyone else who has been part of my journey…
55
56. Publications – Part of thesis
• Dewan, P., Bagroy, S., and Kumaraguru, P.
Hiding in Plain Sight: The Anatomy of Malicious Pages on Facebook.
Book chapter, Lecture Notes in Social Networks, Springer 2017 (To appear)
• Dewan, P., Suri, A., Bharadhwaj, V., Mithal, A., and Kumaraguru, P.
Towards Understanding Crisis Events On Online Social Networks Through Pictures.
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM), 2017.
• Dewan, P., and Kumaraguru, P.
Facebook Inspector (FbI): Towards Automatic Real Time Detection of Malicious Content on
Facebook.
Social Network Analysis and Mining Journal (SNAM), 2017. Volume 7, Issue 1.
• Dewan, P., Bagroy, S., and Kumaraguru, P.
Hiding in Plain Sight: Characterizing and Detecting Malicious Facebook Pages.
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM), 2016 (Short paper)
• Dewan, P., and Kumaraguru, P.
Towards Automatic Real Time Identification of Malicious Posts on Facebook.
Thirteenth Annual Conference on Privacy, Security and Trust (PST), 2015
• Dewan, P., Kashyap, A., and Kumaraguru, P.
Analyzing Social and Stylometric Features to Identify Spear phishing Emails.
APWG eCrime Research Symposium (eCRS), 2014
56
57. Publications – Other
• Kaushal, R., Chandok, S., Jain P., Dewan, P., Gupta, N., and Kumaraguru, P.
Nudging Nemo: Helping Users Control Linkability across Social Networks.
9th International Conference on Social Informatics (SocInfo), 2017 (Short paper).
• Deshpande, P., Joshi, S., Dewan, P., Murthy, K., Mohania, M., Agrawal, S.
The Mask of ZoRRo: preventing information leakage from documents.
Knowledge and Information Systems Journal, 2014
• Mittal, S., Gupta, N., Dewan, P., Kumaraguru, P.
Pinned it! A large scale study of the Pinterest network.
1st ACM IKDD Conference on Data Sciences (CoDS), 2014
• Dewan, P., Gupta, M., Goyal, K., and Kumaraguru, P.
MultiOSN: Realtime Monitoring of Real World Events on Multiple Online Social Media
IBM ICARE 2013
• Magalhães, T., Dewan, P., Kumaraguru, P., Melo-Minardi, R., and Almeida, V.
uTrack: Track Yourself! Monitoring Information on Online Social Media.
22nd International World Wide Web Conference (WWW) (2013)
• Conway M., Dewan P., Kumaraguru P., McInerney L.
'White Pride Worldwide': A Meta- analysis of Stormfront.org
Internet, Politics, Policy 2012: Big Data, Big Challenges?, Oxford Internet Institute,
University of Oxford.
57