4. Spam on Facebook and Twitter
# of active
users
# of spam
accounts
%
Facebook 2.2 billion 60-83 million 2.73%-3.77%
Twitter 330 million 23 million 6.97%
Source: https://www.statista.com/
6. Social Media’s Fundamental Design Flaw
• Sophisticated spam accounts know how to use various features to
make the biggest harm:
• Use shortened URL to trick users
• Buy compromised accounts to look legitimate
• Use campaigns to gain traction in a short period time
• Use bots to amplify the noise
• Social media makes it easier and faster to spread spam.
7. Related Work
• Detection at the tweet level
• Focus on the content of tweets
• E.g., spam words? Overuse of hashtag, URL, mention, …?
• Detection at the account level
• Focus on the characteristics of spam accounts
• E.g., Age of the account? # of followers? # of followees? …
8. Challenges
• Large amount of unlabeled data
• Time and labor intensive
• Feature selection may cause model overfitting problem
• Twitter spam drift
• Spamming behavior changes over time, thus the performance of existing
machine learning based classifiers decreases.
9. Research Questions
• Question 1: Can we find an unsupervised way to learn from the
unlabeled data and later apply what we have learnt on labeled data?
• Will this approach outperform the hand-labeling process?
• Question 2: Can we find a more systematic way to reduce the feature
dimensions instead of feature engineering?
10. Stage 1: Self-taught Learning From Unlabeled Data
Training Data
W/O Label
One-to-N
Encoding
Max-Min
Norm
Sparse Auto-
encoder
Trained
Parameter Set
11. Stage 2: Soft-max Classifier Training
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
13. Self-taught Learning
• Assumption:
• A single unlabeled record is less informative
• A large of amount of unlabeled records may show certain pattern
• Goal:
• Find an effective model to reveal this pattern (if exists)
• Choose sparse auto-encoder for its good performance and simplicity
14. Auto-encoder
• A special neural network whose
output is (almost) identical to its
input.
• A compression tool
• The hidden layer is considered the
compressed representation of the
input.
22. Dataset
• 1065 instances; Each instance has 62 features.
• Split 1065 instances into three groups:
• Training w/o label – 600 instances
• Training w label – 365 instances
• Test w label - 100 instances
• Comparison group: SVM, naïve bayes, and random forests
• Training w label – 365 instances
• Test w label – 100 instances
23. Evaluation
• True Positive (TP): actual spammer, prediction spammer.
• True Negative (TN): actual non-spammer, prediction non-spammer.
• False Positive (FP): actual non-spammer, prediction spammer.
• False Negative (FN): actual spammer, prediction non-spammer.
24. Evaluation
Accuracy: the correctly classified instances over the total number of
test instances.
Precision: P =
𝑇𝑃
(𝑻𝑃 + 𝐹𝑃)
* 100%
Recall: R =
𝑇𝑃
(𝑇𝑃 + 𝐹𝑁)
* 100%
F-Measure: F =
2∗𝑅𝑃
(𝑅 + 𝑃)
26. Results – Comparison with SVM
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Top 5 28 52 2 18 80% 93.3% 60.9% 73.7%
Top 10 27 52 3 18 79% 90% 60.0% 72.0%
Top 20 28 52 3 17 80% 90.3% 62.2% 73.7%
Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%
27. Results – Comparison with Random Forests &
Naïve Bayes
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Random
Forrest
32 52 3 13 84% 91% 71.0% 80.0%
Naïve
Bayes
33 50 5 12 83% 86.8% 73.0% 79.5%
28. Conclusion
• Self-taught Learning: large amount of unlabeled data + small amount
of labeled data
• Sparse AE: reduce the feature dimensions
• Fine tuning: improve the deep learning model by large extent.
29. Limitation & Future Work
• The dataset we use is relatively small.
• We are still exploring new ways to apply this model on raw data.
30. A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue
San Francisco State University
Notas do Editor
The key is to compute the partial derivatives.
We conducted an experiment on this implementation but the result is not as expected.