1. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Adversarial IR - Social Spamming
Nicola Miotto
Unipd - Computer Science
January 22, 2011
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 1 / 39
2. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Outline
1 Introduction
Spam
Adversarial IR
2 Tag-spam detection in Social Bookmarking systems
Problem description
Features
Classification
3 Youtube Video Spamming
Problem description
Features
Classificatio
4 Conclusions
5 References
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 2 / 39
3. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Introduction
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 3 / 39
4. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - History
1970: BBC broadcasts the Spam sketch by Monty Python’s Flying
Circus, where the current meaning of the term is derived;
1978: advisory message sent to 393 ARPANET users, the earliest
documented spam;
’90: Make Money Fast flooding around in many newsgroup. Frist
association an IT related field of the term spam;
1998: new definition for the term spam in the New Oxford Dictionary
of English:
Definition
Irrelevant or inappropriate messages sent on the Internet to a large number
of newsgroups or users.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 4 / 39
5. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Fields
E-mail
Istant Messaging: Messaging spam
Web-Search: Spamdexing
Social systems: Social spam
And so on...
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 5 / 39
6. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Spammer
Earn money on the web!
Google AdSense or Heyos like services allow users to place Ad
automatically generated in their web pages in order to get money from
clicks and page impressions.
Legal Avertiser:
He produces web site where to put content-related Ad;
He improves the pagerank of the website for the relevant keywork;
Try to lead potential customers to his websites;
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 6 / 39
7. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Spammer
Spammer:
Website contents just used to attract users and improve the pagerank;
No discrimination between interested and not interested users;
Authomatic spam-network generation programs:
they find the relevant keywords (eg: via AdWords)
they register the domain names containing those keywords;
they create complete websites with fake contents with the keywords
found;
they link the generated websites together in order to improve the
pagerank;
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 7 / 39
8. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Social Spamming
Spam campaign directed to Social Network users
Social bookmarking systems: Delicious;
Video social network: YouTube;
General purpose social network: Facebook;
and so on..
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 8 / 39
9. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Social Spamming
Features:
Lots of user related information;
Easier to point to a specific demographic segment;
Cheaper (usually);
Adopted solution (most of the times): Report abuse
→ generic solution, but less effective than ad-hoc ones.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 9 / 39
10. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Consequences
Users hijacked towards areas out of their informative needs;
unfair competition with legal advertiser
Information poisoning due to the spam noise
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 10 / 39
11. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Adversarial IR - Definition
Adversarial: “Assumes competing parties trying to affect the outcome of
a system (system could be an algorithm, a market, etc)”
Adversarial IR: “Information retrieval, ranking, or classification system
affected by multiple parties acting in their own interest”
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 11 / 39
12. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Adversarial IR - AIRWeb
AIRWeb
Adversarial Information Retrieval on the Web
Annual workshop about Adversarial IR
Researchers and industry practitioners gathered to to present and
discuss advances in the state-of-the-art of Adversarial IT
First workshop in 2005 (Japan)
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 12 / 39
13. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Discussed techniques
AIRWeb papers 42
Social spam recognition techniques discussed during the
AIRWeb workshops
Supervised Machine Learning 42
1 Feature modelling
2 Training dataset retrieval
3 Machine learning (ie: SVM)
4 Result evaulation
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 13 / 39
14. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Tag-spam detection in Social Bookmarking
systems
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 14 / 39
15. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Problem description - Tag-spam
Social bookmarking system:
User can associate meta-information (tags) to resources (links);
Association of one o more words to any resource;
Advertiser:
Social tagging: posting link to his website tagging them with
content-related keywords
Spammer:
Most “famous” keywords (eg: music) used to tag not-related websites
(eg: his spam-websites);
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 15 / 39
16. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Figure: Delicious.com Screenshot (2011)
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 16 / 39
17. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Figure: Example: Tag-spam on Delicious.com (2008)
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 17 / 39
18. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Problem description - Folksonomy
Data structure to represent a social tagging system;
Hyper-graph connecting users, resources and tags;
Symbols:
u ∈ U, U set of users;
r ∈ R, R set of resources;
t ∈ T , T set of tags;
post= {(u, r , t1 ), ..., (u, r , tn )} = {(u, r , (t1 , ..., tn ))}
F = {post1 , ..., postn }
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 18 / 39
19. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Figure: Folksonomy graphical representation example
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 19 / 39
20. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Tag based
Which tags do spammers use?
TagSpam Ut = {u : (∃r : (u, r , t) ∈ F )}
St ∈ Ut , identified as spammer
|St |
Pr (t) = |Ut |
T (u, r ) = {t : (u, r , t) ∈ F }
1
fTagSpam (u, r ) = Pr (t)
|T (u, r )|
t∈T (u,r )
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 20 / 39
21. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Tag based
Is there as semantical relationship between tags?
TagBlur σ(t1 , t2 ) ∈ [0, 1], normalized tag similarity between t1 e
t2
Z = tag pairs in T( u, r )
1 1 1
fTagBlur (u, r ) = −
Z σ(t1 , t2 ) + 1+
t1 =t2 ∈T (u,r )
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 21 / 39
22. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Resource based I
DomFP Spammers use programs to generate pages → same
content for spam pages
We know the fingerprint of some spam pages
Compute the likelihood that r is spam comparing r
fingerprint to know ones
NumAds Usually spammers just offers lots of Ads
NumAds application exampe: count
googlesyndication.com amount in the resource html
code
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 22 / 39
23. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Resource based
Plagiarism Spammers usually copy content from high-ranked
websites
Compare r contents to other webpages
ValidLinks Spammer websites are frequently knocked down
Lots of invalid links posted by u implies greater
likelihood of u being spammer
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 23 / 39
24. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Training dataset
BibSonomy.org :
public dataset
27.000 user and their post
hand made classification → 25.000 spammers and 2.000
legal users
Classification :
Binary classification into either spammer or not
spammer
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 24 / 39
26. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
YouTube Video Spamming
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 26 / 39
27. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Description - Youtube video spam
Video-response: user answers to a video with another related video
Spammer: user answering with not related videos
Reasons:
increase video popularity
marketing campaign
pornography distribution
system poisoning
Issue: automatic content based spam recognition hard to implement
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 27 / 39
28. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Description - Techniques
Content-based recognition:
video content analysis
too many computational resource
hard to generalize the idea of spam in a video, unless it doesn’t have
textual conent
Video and users relationship analysis:
lots of informations publicly available
spammers have specific social features (they’re lonely)
user behaviour towards spammers can be automatically analysed
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 28 / 39
29. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - User-based
For each user:
# posted videos
# friends
# watched videos
# favourite videos
# video responses
# responded videos
# subscrition
# subscriber
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 29 / 39
30. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Video-based
2 category per user:
All posted videos
Just video responses
7 attributes each of them
# views
duration
# votes
# comments
# favourites
# youtube honours
# external links
Total and average for each attribute attribute, so 28 in total.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 30 / 39
31. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Social network
Basate su Video response user graph:
directed graph (X,Y)
each user is a node in the graph
(x1 , x2 ) directed edge from x1 ∈ X to x2 ∈ Y if x1 ∈ X responded to
a video of x2 ∈ Y
Analysis:
in/out degree for each “user”
assortativity: degree(n) / avg( degree(neighbours(n)) )
userrank: depending on quantity and quality of in links
clustering coefficient, betwenness, reciprocity
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 31 / 39
32. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Dataset
Data crawling:
Starting from top-100 most responded video, retrieving connected data
concerning video responses, responded video e users.
Hand made classification:
Each user with at leas a video response not related to the responded video
is classified as spammer.
Test set:
473 legal users + 119 spammer = 592 users
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 32 / 39
33. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Training
Support Vector Machine
5-fold cross-validation
Adopted features:
user-based
video-based
social-network
all together
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 33 / 39
34. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Results
Measure User Video SN All
TP 0.054 0.426 0.375 0.439
TN 0.998 0.922 1 0.981
FP 0.002 0.078 0 0.019
FN 0.946 0.574 0.625 0.561
Accuracy 0.821 0.821 0.874 0.870
F 0.094 0.484 0.540 0.558
TP = users correctly classified as spammers
FP = legal users classified as spammers
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 34 / 39
35. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Conclusions
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 35 / 39
37. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Conclusions
Pro:
Few legal users classified as spammer
Tag-spam recognition finds most of the spammer
Dataset build out of publicly available information
Contro:
Social system already poisoned by spam
Hand made classification of training examples
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 37 / 39
38. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
References I
Brian D. Davison,
The Potential for Research and Development in Adversarial
Information Retrieval,
Computer Science and Engr., Lehigh University, Cambridge, 2009,
available at http://airweb.cse.lehigh.edu/2009/slides/
Davison-AIRWeb2009-Keynote.pdf.
B.Markines,C.Cattuto,F.Menczer,D.Benz,A.Hotho,and G. Stumme,
Evaluating similarity measures for emergent semantics of social
tagging,
In Proc. 18th Intl. WWW Conf., 2009,
available at http://www2009.org/proceedings/pdf/p641.pdf.
Benjamin Markines, Ciro Cattuto, Filippo Menczer,
Social Spam Detection, AIRWeb ’09, April 21, 2009 Madrid, Spain,
available at
http://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 38 / 39
39. Introduction Tag-spam detection Youtube Video Spamming Conclusions References
References II
Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida,
Chao Zhang, Keith Ros,
Identifying Video Spammers in Online Social Networks,
AIRWeb ’08, April 22, 2008 Beijing, China,
available at http://airweb.cse.lehigh.edu/2008/submissions/
benevenuto_2008_spam_video.pdf.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 39 / 39