SlideShare uma empresa Scribd logo
1 de 30
A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) & Hao Yue
San Francisco State University
Outline
• Problem and Challenges
• Past Work
• Our Model and Results
• Conclusion
• Future Work
What Is Spam?
Spam on Facebook and Twitter
# of active
users
# of spam
accounts
%
Facebook 2.2 billion 60-83 million 2.73%-3.77%
Twitter 330 million 23 million 6.97%
Source: https://www.statista.com/
Various Social Media Sites
Social Media’s Fundamental Design Flaw
• Sophisticated spam accounts know how to use various features to
make the biggest harm:
• Use shortened URL to trick users
• Buy compromised accounts to look legitimate
• Use campaigns to gain traction in a short period time
• Use bots to amplify the noise
• Social media makes it easier and faster to spread spam.
Related Work
• Detection at the tweet level
• Focus on the content of tweets
• E.g., spam words? Overuse of hashtag, URL, mention, …?
• Detection at the account level
• Focus on the characteristics of spam accounts
• E.g., Age of the account? # of followers? # of followees? …
Challenges
• Large amount of unlabeled data
• Time and labor intensive
• Feature selection may cause model overfitting problem
• Twitter spam drift
• Spamming behavior changes over time, thus the performance of existing
machine learning based classifiers decreases.
Research Questions
• Question 1: Can we find an unsupervised way to learn from the
unlabeled data and later apply what we have learnt on labeled data?
• Will this approach outperform the hand-labeling process?
• Question 2: Can we find a more systematic way to reduce the feature
dimensions instead of feature engineering?
Stage 1: Self-taught Learning From Unlabeled Data
Training Data
W/O Label
One-to-N
Encoding
Max-Min
Norm
Sparse Auto-
encoder
Trained
Parameter Set
Stage 2: Soft-max Classifier Training
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
Stage 3: Classification
Preprocessed
Test Data
Sparse Auto-
encoder
Soft-Max
Regression
Spam/Non-
Spam
Self-taught Learning
• Assumption:
• A single unlabeled record is less informative
• A large of amount of unlabeled records may show certain pattern
• Goal:
• Find an effective model to reveal this pattern (if exists)
• Choose sparse auto-encoder for its good performance and simplicity
Auto-encoder
• A special neural network whose
output is (almost) identical to its
input.
• A compression tool
• The hidden layer is considered the
compressed representation of the
input.
Auto-encoder
• Model parameter:
(𝑊, b) = (𝑊(1), 𝑏(1), 𝑊(2), 𝑏(2))
• Activation function
𝑎1
2
= f(𝑊11
(1)
𝑥1 + 𝑊12
(1)
𝑥2+ 𝑊13
(1)
𝑥3+ 𝑏1
(1)
)
𝑎2
2
= f(𝑊21
(1)
𝑥1 + 𝑊22
(1)
𝑥2+ 𝑊23
(1)
𝑥3+ 𝑏2
(1)
)
𝑎3
2
= f(𝑊31
(1)
𝑥1 + 𝑊32
(1)
𝑥2+ 𝑊33
(1)
𝑥3+ 𝑏3
(1)
)
• Hypothesis ℎ 𝑤,𝑏(𝑥) :
ℎ 𝑤,𝑏(𝑥)= 𝑎1
3
= f(𝑊11
(2)
𝑎1
2
+ 𝑊12
(2)
𝑎2
2
+ 𝑊13
(2)
𝑎3
2
+ 𝑏1
(2)
) = 𝑥
Sparse Auto-encoder
• Sparsity parameter
• Definition: a constraint imposed on the hidden layer
• Goal: ensure pattern will be revealed even if the size of hidden layer is large
• Average activation: 𝜌 =
1
𝑚 𝑖=1
𝑚
[𝑎𝑗
(2)
(𝑥(𝑖))]
• Penalty term
• 𝜌 = 𝜌 (𝜌 = 0.05)
• Kullback-Leibler (KL) divergence: 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌)= 𝜌𝑙𝑜𝑔
𝜌
𝜌
+ (1-𝜌) l𝑙𝑜𝑔
1− 𝜌
1− 𝜌
• 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌) = 0 if 𝜌= 𝜌
Cost Function
J(W,b) =
𝟏
𝒎 𝒊=𝟏
𝒎
| |𝒙𝒊 − 𝒙𝒊||
𝟐
+
𝝀
𝟐
( 𝒌,𝒏 𝑾 𝟐 + 𝒏,𝒌 𝑽 𝟐 + 𝒌 𝒃 𝟏
𝟐
+ 𝒏 𝒃 𝟐
𝟐
) +
𝜷 𝒋=𝟏
𝒌
𝑲𝑳(𝝆|| 𝝆𝒋)
Average sum-of-square error term
Weigh decay term
Penalty term
Cost Function
• Goal: minimize J(W, b) as a function of W and b
• Steps
• Initialization
• Update parameters with gradient descent
𝑊𝑖𝑗
(𝑙)
= 𝑊𝑖𝑗
(𝑙)
- 𝛼
𝜕
𝜕𝑊𝑖𝑗
𝑙 𝐽 𝑊, 𝑏
𝑏𝑖
(𝑙)
= 𝑏𝑖
(𝑙)
- 𝛼
𝜕
𝜕𝑏𝑖
(𝑙) 𝐽 𝑊, 𝑏
Back-propagation
𝛿𝑖
(𝑛 𝑙)
“error term”
how much the node is “responsible” for any error in the output
Back-propagation
1. Perform a feedforward pass, compute activations for layers𝐿2, 𝐿3,
up until the output layer 𝐿 𝑛 𝑙
2. For each output unit I in layer 𝑛𝑙 (the output layer), set
• 𝛿𝑖
(𝑛 𝑙)
= -(𝑦𝑖 − 𝑎𝑖
(𝑛 𝑙)
) x 𝑓−1(𝑧𝑖
(𝑛 𝑙)
)
3. For l = 𝑛𝑙 -1, 𝑛𝑙-2, 𝑛𝑙-3, …, 2
• For each node I in layer l, set 𝛿𝑖
(𝑙)
= ( 𝑗=1
𝑠 𝑙+1
𝑊𝑖𝑗
𝑙
𝛿𝑗
(𝑙+1)
) 𝑓−1(𝑧𝑖
(𝑙)
)
4. Compute the partial derivatives
• 𝛼
𝜕
𝜕𝑊𝑖𝑗
𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝑎𝑗
(𝑙)
𝛿𝑖
(𝑙+1)
• 𝛼
𝜕
𝜕𝑏𝑖
𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝛿𝑖
(𝑙+1)
Fine-tuning
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
Fine-tuning
Dataset
• 1065 instances; Each instance has 62 features.
• Split 1065 instances into three groups:
• Training w/o label – 600 instances
• Training w label – 365 instances
• Test w label - 100 instances
• Comparison group: SVM, naïve bayes, and random forests
• Training w label – 365 instances
• Test w label – 100 instances
Evaluation
• True Positive (TP): actual spammer, prediction spammer.
• True Negative (TN): actual non-spammer, prediction non-spammer.
• False Positive (FP): actual non-spammer, prediction spammer.
• False Negative (FN): actual spammer, prediction non-spammer.
Evaluation
Accuracy: the correctly classified instances over the total number of
test instances.
Precision: P =
𝑇𝑃
(𝑻𝑃 + 𝐹𝑃)
* 100%
Recall: R =
𝑇𝑃
(𝑇𝑃 + 𝐹𝑁)
* 100%
F-Measure: F =
2∗𝑅𝑃
(𝑅 + 𝑃)
Results
Hidden L2
Hidden
L1
15 20 25 30 35 40 45 50 55 Avg
55 86% 88% 85% 84% 87% 85% 83% 86% 86% 86%
50 84% 84% 86% 88% 86% 89% 87% 86% 88% 86%
45 85% 88% 87% 86% 85% 84% 88% 86% 86% 86%
40 88% 87% 85% 85% 85% 87% 87% 86% 89% 87%
35 87% 88% 87% 86% 87% 86% 86% 85% 86% 86%
30 85% 86% 89% 85% 85% 84% 83% 87% 88% 86%
25 87% 87% 88% 87% 85% 88% 85% 87% 88% 87%
20 84% 88% 83% 88% 86% 85% 88% 87% 86% 86%
15 83% 83% 83% 87% 85% 82% 85% 86% 85% 84%
Avg 85% 87% 86% 86% 86% 86% 86% 86% 87%
Results – Comparison with SVM
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Top 5 28 52 2 18 80% 93.3% 60.9% 73.7%
Top 10 27 52 3 18 79% 90% 60.0% 72.0%
Top 20 28 52 3 17 80% 90.3% 62.2% 73.7%
Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%
Results – Comparison with Random Forests &
Naïve Bayes
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Random
Forrest
32 52 3 13 84% 91% 71.0% 80.0%
Naïve
Bayes
33 50 5 12 83% 86.8% 73.0% 79.5%
Conclusion
• Self-taught Learning: large amount of unlabeled data + small amount
of labeled data
• Sparse AE: reduce the feature dimensions
• Fine tuning: improve the deep learning model by large extent.
Limitation & Future Work
• The dataset we use is relatively small.
• We are still exploring new ways to apply this model on raw data.
A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue
San Francisco State University

Mais conteúdo relacionado

Semelhante a A deep learning approach for twitter spam detection lijie zhou

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptxssuserf07225
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learningKoundinya Desiraju
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Sagar Deogirkar
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handlinghiratufail
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural NetworkDessy Amirudin
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsJames Brotsos
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptxTadiwaMawere
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learningNimrita Koul
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptxPrabhuSelvaraj15
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsWeitao Duan
 

Semelhante a A deep learning approach for twitter spam detection lijie zhou (20)

DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
Comparative Study of Machine Learning Algorithms for Sentiment Analysis with ...
 
DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3DSA 103 Object Oriented Programming :: Week 3
DSA 103 Object Oriented Programming :: Week 3
 
Soft And Handling
Soft And HandlingSoft And Handling
Soft And Handling
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
Artificial Neural Network
Artificial Neural NetworkArtificial Neural Network
Artificial Neural Network
 
Ch02 primitive-data-definite-loops
Ch02 primitive-data-definite-loopsCh02 primitive-data-definite-loops
Ch02 primitive-data-definite-loops
 
BIRTE-13-Kawashima
BIRTE-13-KawashimaBIRTE-13-Kawashima
BIRTE-13-Kawashima
 
Problem-solving and design 1.pptx
Problem-solving and design 1.pptxProblem-solving and design 1.pptx
Problem-solving and design 1.pptx
 
Lesson 39
Lesson 39Lesson 39
Lesson 39
 
AI Lesson 39
AI Lesson 39AI Lesson 39
AI Lesson 39
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Nimrita deep learning
Nimrita deep learningNimrita deep learning
Nimrita deep learning
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Training DNN Models - II.pptx
Training DNN Models - II.pptxTraining DNN Models - II.pptx
Training DNN Models - II.pptx
 
Large Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile MetricsLarge Scale Online Experimentation with Quantile Metrics
Large Scale Online Experimentation with Quantile Metrics
 

Último

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentBharaniDharan195623
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solidnamansinghjarodiya
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxVelmuruganTECE
 

Último (20)

welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
Configuration of IoT devices - Systems managament
Configuration of IoT devices - Systems managamentConfiguration of IoT devices - Systems managament
Configuration of IoT devices - Systems managament
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
Engineering Drawing section of solid
Engineering Drawing     section of solidEngineering Drawing     section of solid
Engineering Drawing section of solid
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
Internet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptxInternet of things -Arshdeep Bahga .pptx
Internet of things -Arshdeep Bahga .pptx
 

A deep learning approach for twitter spam detection lijie zhou

  • 1. A Deep Learning Approach For Twitter Spam Detection Lijie Zhou (lijie@mail.sfsu.edu) & Hao Yue San Francisco State University
  • 2. Outline • Problem and Challenges • Past Work • Our Model and Results • Conclusion • Future Work
  • 4. Spam on Facebook and Twitter # of active users # of spam accounts % Facebook 2.2 billion 60-83 million 2.73%-3.77% Twitter 330 million 23 million 6.97% Source: https://www.statista.com/
  • 6. Social Media’s Fundamental Design Flaw • Sophisticated spam accounts know how to use various features to make the biggest harm: • Use shortened URL to trick users • Buy compromised accounts to look legitimate • Use campaigns to gain traction in a short period time • Use bots to amplify the noise • Social media makes it easier and faster to spread spam.
  • 7. Related Work • Detection at the tweet level • Focus on the content of tweets • E.g., spam words? Overuse of hashtag, URL, mention, …? • Detection at the account level • Focus on the characteristics of spam accounts • E.g., Age of the account? # of followers? # of followees? …
  • 8. Challenges • Large amount of unlabeled data • Time and labor intensive • Feature selection may cause model overfitting problem • Twitter spam drift • Spamming behavior changes over time, thus the performance of existing machine learning based classifiers decreases.
  • 9. Research Questions • Question 1: Can we find an unsupervised way to learn from the unlabeled data and later apply what we have learnt on labeled data? • Will this approach outperform the hand-labeling process? • Question 2: Can we find a more systematic way to reduce the feature dimensions instead of feature engineering?
  • 10. Stage 1: Self-taught Learning From Unlabeled Data Training Data W/O Label One-to-N Encoding Max-Min Norm Sparse Auto- encoder Trained Parameter Set
  • 11. Stage 2: Soft-max Classifier Training Preprocessed Labeled Training Data Sparse Auto- encoder Soft-max Regression Trained Parameter Set
  • 12. Stage 3: Classification Preprocessed Test Data Sparse Auto- encoder Soft-Max Regression Spam/Non- Spam
  • 13. Self-taught Learning • Assumption: • A single unlabeled record is less informative • A large of amount of unlabeled records may show certain pattern • Goal: • Find an effective model to reveal this pattern (if exists) • Choose sparse auto-encoder for its good performance and simplicity
  • 14. Auto-encoder • A special neural network whose output is (almost) identical to its input. • A compression tool • The hidden layer is considered the compressed representation of the input.
  • 15. Auto-encoder • Model parameter: (𝑊, b) = (𝑊(1), 𝑏(1), 𝑊(2), 𝑏(2)) • Activation function 𝑎1 2 = f(𝑊11 (1) 𝑥1 + 𝑊12 (1) 𝑥2+ 𝑊13 (1) 𝑥3+ 𝑏1 (1) ) 𝑎2 2 = f(𝑊21 (1) 𝑥1 + 𝑊22 (1) 𝑥2+ 𝑊23 (1) 𝑥3+ 𝑏2 (1) ) 𝑎3 2 = f(𝑊31 (1) 𝑥1 + 𝑊32 (1) 𝑥2+ 𝑊33 (1) 𝑥3+ 𝑏3 (1) ) • Hypothesis ℎ 𝑤,𝑏(𝑥) : ℎ 𝑤,𝑏(𝑥)= 𝑎1 3 = f(𝑊11 (2) 𝑎1 2 + 𝑊12 (2) 𝑎2 2 + 𝑊13 (2) 𝑎3 2 + 𝑏1 (2) ) = 𝑥
  • 16. Sparse Auto-encoder • Sparsity parameter • Definition: a constraint imposed on the hidden layer • Goal: ensure pattern will be revealed even if the size of hidden layer is large • Average activation: 𝜌 = 1 𝑚 𝑖=1 𝑚 [𝑎𝑗 (2) (𝑥(𝑖))] • Penalty term • 𝜌 = 𝜌 (𝜌 = 0.05) • Kullback-Leibler (KL) divergence: 𝑗=1 𝐾 𝐾𝐿(𝜌 || 𝜌)= 𝜌𝑙𝑜𝑔 𝜌 𝜌 + (1-𝜌) l𝑙𝑜𝑔 1− 𝜌 1− 𝜌 • 𝑗=1 𝐾 𝐾𝐿(𝜌 || 𝜌) = 0 if 𝜌= 𝜌
  • 17. Cost Function J(W,b) = 𝟏 𝒎 𝒊=𝟏 𝒎 | |𝒙𝒊 − 𝒙𝒊|| 𝟐 + 𝝀 𝟐 ( 𝒌,𝒏 𝑾 𝟐 + 𝒏,𝒌 𝑽 𝟐 + 𝒌 𝒃 𝟏 𝟐 + 𝒏 𝒃 𝟐 𝟐 ) + 𝜷 𝒋=𝟏 𝒌 𝑲𝑳(𝝆|| 𝝆𝒋) Average sum-of-square error term Weigh decay term Penalty term
  • 18. Cost Function • Goal: minimize J(W, b) as a function of W and b • Steps • Initialization • Update parameters with gradient descent 𝑊𝑖𝑗 (𝑙) = 𝑊𝑖𝑗 (𝑙) - 𝛼 𝜕 𝜕𝑊𝑖𝑗 𝑙 𝐽 𝑊, 𝑏 𝑏𝑖 (𝑙) = 𝑏𝑖 (𝑙) - 𝛼 𝜕 𝜕𝑏𝑖 (𝑙) 𝐽 𝑊, 𝑏
  • 19. Back-propagation 𝛿𝑖 (𝑛 𝑙) “error term” how much the node is “responsible” for any error in the output
  • 20. Back-propagation 1. Perform a feedforward pass, compute activations for layers𝐿2, 𝐿3, up until the output layer 𝐿 𝑛 𝑙 2. For each output unit I in layer 𝑛𝑙 (the output layer), set • 𝛿𝑖 (𝑛 𝑙) = -(𝑦𝑖 − 𝑎𝑖 (𝑛 𝑙) ) x 𝑓−1(𝑧𝑖 (𝑛 𝑙) ) 3. For l = 𝑛𝑙 -1, 𝑛𝑙-2, 𝑛𝑙-3, …, 2 • For each node I in layer l, set 𝛿𝑖 (𝑙) = ( 𝑗=1 𝑠 𝑙+1 𝑊𝑖𝑗 𝑙 𝛿𝑗 (𝑙+1) ) 𝑓−1(𝑧𝑖 (𝑙) ) 4. Compute the partial derivatives • 𝛼 𝜕 𝜕𝑊𝑖𝑗 𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝑎𝑗 (𝑙) 𝛿𝑖 (𝑙+1) • 𝛼 𝜕 𝜕𝑏𝑖 𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝛿𝑖 (𝑙+1)
  • 22. Dataset • 1065 instances; Each instance has 62 features. • Split 1065 instances into three groups: • Training w/o label – 600 instances • Training w label – 365 instances • Test w label - 100 instances • Comparison group: SVM, naïve bayes, and random forests • Training w label – 365 instances • Test w label – 100 instances
  • 23. Evaluation • True Positive (TP): actual spammer, prediction spammer. • True Negative (TN): actual non-spammer, prediction non-spammer. • False Positive (FP): actual non-spammer, prediction spammer. • False Negative (FN): actual spammer, prediction non-spammer.
  • 24. Evaluation Accuracy: the correctly classified instances over the total number of test instances. Precision: P = 𝑇𝑃 (𝑻𝑃 + 𝐹𝑃) * 100% Recall: R = 𝑇𝑃 (𝑇𝑃 + 𝐹𝑁) * 100% F-Measure: F = 2∗𝑅𝑃 (𝑅 + 𝑃)
  • 25. Results Hidden L2 Hidden L1 15 20 25 30 35 40 45 50 55 Avg 55 86% 88% 85% 84% 87% 85% 83% 86% 86% 86% 50 84% 84% 86% 88% 86% 89% 87% 86% 88% 86% 45 85% 88% 87% 86% 85% 84% 88% 86% 86% 86% 40 88% 87% 85% 85% 85% 87% 87% 86% 89% 87% 35 87% 88% 87% 86% 87% 86% 86% 85% 86% 86% 30 85% 86% 89% 85% 85% 84% 83% 87% 88% 86% 25 87% 87% 88% 87% 85% 88% 85% 87% 88% 87% 20 84% 88% 83% 88% 86% 85% 88% 87% 86% 86% 15 83% 83% 83% 87% 85% 82% 85% 86% 85% 84% Avg 85% 87% 86% 86% 86% 86% 86% 86% 87%
  • 26. Results – Comparison with SVM TP TN FP FN A P R F SAE 34 52 3 11 86% 91.9% 75.6% 83.0% Top 5 28 52 2 18 80% 93.3% 60.9% 73.7% Top 10 27 52 3 18 79% 90% 60.0% 72.0% Top 20 28 52 3 17 80% 90.3% 62.2% 73.7% Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%
  • 27. Results – Comparison with Random Forests & Naïve Bayes TP TN FP FN A P R F SAE 34 52 3 11 86% 91.9% 75.6% 83.0% Random Forrest 32 52 3 13 84% 91% 71.0% 80.0% Naïve Bayes 33 50 5 12 83% 86.8% 73.0% 79.5%
  • 28. Conclusion • Self-taught Learning: large amount of unlabeled data + small amount of labeled data • Sparse AE: reduce the feature dimensions • Fine tuning: improve the deep learning model by large extent.
  • 29. Limitation & Future Work • The dataset we use is relatively small. • We are still exploring new ways to apply this model on raw data.
  • 30. A Deep Learning Approach For Twitter Spam Detection Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue San Francisco State University

Notas do Editor

  1. The key is to compute the partial derivatives.
  2. We conducted an experiment on this implementation but the result is not as expected.