This document describes WarningBird, a system for detecting suspicious URLs in Twitter streams. It uses URL redirection chains and tweet context to extract 11 features for classifying URLs. WarningBird crawls URLs to obtain redirection chains, then performs domain grouping, feature extraction and logistic regression classification. It achieved low false positive and negative rates on real Twitter data. WarningBird can process URLs quickly and detect suspicious accounts faster than Twitter's own systems.
1. WarningBird: Detecting Suspicious URLs in
Twitter Stream
Sangho Lee and Jong Kim
Pohang University of Science and Technology
January 18, 2012
2. Threat
Post URLs to attract traffic to website
Can deliver various payloads
3. Threat
Post URLs to attract traffic to website
Can deliver various payloads
Spam
4. Threat
Post URLs to attract traffic to website
Can deliver various payloads
Spam
Phishing
5. Threat
Post URLs to attract traffic to website
Can deliver various payloads
Spam
Phishing
Download
Malicious
Software
6. Twitter
Online micro-blogging service
Large (about 100 million accounts)
URL shortener services
Tweets broadcasted to legitimate users
7. Twitter
Online micro-blogging service
Large (about 100 million accounts)
URL shortener services
Tweets broadcasted to legitimate users
Good vector for attackers to attract traffic
Many potential targets
URL shorteners common and mask actual website
Many users view tweets based on content and not authorship
8. Existing Detection Approaches and Limitations
1. Detect accounts based on account information
E.g., ratio of Tweets with URLs to Tweets without URLs
Easily fabricated by attacker
9. Existing Detection Approaches and Limitations
1. Detect accounts based on account information
E.g., ratio of Tweets with URLs to Tweets without URLs
Easily fabricated by attacker
2. Detect accounts based on social graph
E.g., connectivity measures for each node
Hard to obtain and analyze large amounts of Twitter data
10. Existing Detection Approaches and Limitations
1. Detect accounts based on account information
E.g., ratio of Tweets with URLs to Tweets without URLs
Easily fabricated by attacker
2. Detect accounts based on social graph
E.g., connectivity measures for each node
Hard to obtain and analyze large amounts of Twitter data
3. Crawl URLs to classify them
E.g., detect malicious URLs based on html content
Redirection chains used by attackers
11. Redirection Chains
Redirect chains start by resolving shortened URL
Several hops of URLs owned by attacker to redirect user
Dynamically choose which page a user ultimately visits
Crawlers goto legitimate URL
Legitimate users goto the malicious URL
12. Problem
Given a URL posted on Twitter, determine whether a
legitimate user would ultimately be directed to a malicious
URL by visiting the URL on Twitter
13. Problem
Given a URL posted on Twitter, determine whether a
legitimate user would ultimately be directed to a malicious
URL by visiting the URL on Twitter
Assumptions:
Cannot use features easily fabricated by attacker
No access to large Twitter graph
Have access to part of redirect chain available to crawlers
Redirect chains cannot be fabricated
14. Problem
Given a URL posted on Twitter, determine whether a
legitimate user would ultimately be directed to a malicious
URL by visiting the URL on Twitter
Assumptions:
Cannot use features easily fabricated by attacker
No access to large Twitter graph
Have access to part of redirect chain available to crawlers
Redirect chains cannot be fabricated
Solution Overview:
Create classifier
Rely on redirect chain for features
Validate accuracy/performance with Twitter data
16. Data Collection
Use Twitter Streaming API to collect Tweets
Keep only Tweets with URLs
Crawl and store URL chain of each URL
Queue many Tweets to be analyzed together
17. Feature Extraction
Grouping domains xyz.com
= 20.30.40.50 = abc.com
Find entry point URLs
11 features based on URL
chains and Tweet context
19. Classifier
Features are all normalized between zero and one
Logistic regression classification experimentally found to be
the best
Ground truth from Twitter account status for supervised
learning
20. Experimentation
Real Twitter data from Twitter Streaming API
Their own commodity hardware
Performed experiments on Twitter data to investigate
Accuracy
Performance
Delay in Detection
21. Accuracy Results
60 days of training data 183k benign and 42k malicious URLs
30 days of test data 71k benign and 6.7k malicious URLs
Achieved 3.67% FPR and 3.21% FNR
Of 71k benign, 2.6k marked malicious
Of 6.7k malicious, 200 not discovered
22. Performance Results
Running time of various components
24ms time to crawl redirections (100 concurrent crawls)
2ms domain grouping
1.6ms feature extraction
0.5ms classification
Process 100,000 URLs in one hour
Can distribute redirection crawling to improve this
23. Delay Results
WarningBird can detect faster than Twitter
Only shows results for those accounts suspended by Twitter
within a day
24. Conclusion
Found important feature others have ignored
Attacker must either spend more for more redirection servers
or risk being caught