Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Adversarial IR - Social Spamming

Nicola Miotto

Unipd - Computer Science

January 22, 2011

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 1 / 39


Outline

1 Introduction
Spam
Adversarial IR
2 Tag-spam detection in Social Bookmarking systems
Problem description
Features
Classiﬁcation
3 Youtube Video Spamming
Problem description
Features
Classiﬁcatio
4 Conclusions
5 References



Introduction



Spam - History

1970: BBC broadcasts the Spam sketch by Monty Python’s Flying
Circus, where the current meaning of the term is derived;
1978: advisory message sent to 393 ARPANET users, the earliest
documented spam;
’90: Make Money Fast flooding around in many newsgroup. Frist
association an IT related field of the term spam;
1998: new definition for the term spam in the New Oxford Dictionary
of English:
Definition
Irrelevant or inappropriate messages sent on the Internet to a large number
of newsgroups or users.



Spam - Fields

E-mail

Istant Messaging: Messaging spam

Web-Search: Spamdexing

Social systems: Social spam

And so on...



Spam - Spammer

Earn money on the web!
Google AdSense or Heyos like services allow users to place Ad
automatically generated in their web pages in order to get money from
clicks and page impressions.

Legal Avertiser:
He produces web site where to put content-related Ad;

He improves the pagerank of the website for the relevant keywork;

Try to lead potential customers to his websites;



Spam - Spammer

Spammer:
Website contents just used to attract users and improve the pagerank;

No discrimination between interested and not interested users;
Authomatic spam-network generation programs:
they ﬁnd the relevant keywords (eg: via AdWords)
they register the domain names containing those keywords;
they create complete websites with fake contents with the keywords
found;
they link the generated websites together in order to improve the
pagerank;



Spam - Social Spamming

Spam campaign directed to Social Network users
Social bookmarking systems: Delicious;
Video social network: YouTube;
General purpose social network: Facebook;
and so on..



Spam - Social Spamming

Features:
Lots of user related information;
Easier to point to a speciﬁc demographic segment;
Cheaper (usually);
Adopted solution (most of the times): Report abuse
→ generic solution, but less eﬀective than ad-hoc ones.



Spam - Consequences

Users hijacked towards areas out of their informative needs;

unfair competition with legal advertiser

Information poisoning due to the spam noise



Adversarial IR - Definition

Adversarial: “Assumes competing parties trying to affect the outcome of
a system (system could be an algorithm, a market, etc)”

Adversarial IR: “Information retrieval, ranking, or classification system
affected by multiple parties acting in their own interest”



Adversarial IR - AIRWeb

AIRWeb
Adversarial Information Retrieval on the Web

Annual workshop about Adversarial IR

Researchers and industry practitioners gathered to to present and
discuss advances in the state-of-the-art of Adversarial IT

First workshop in 2005 (Japan)



Discussed techniques

AIRWeb papers 42
Social spam recognition techniques discussed during the
AIRWeb workshops

Supervised Machine Learning 42
1 Feature modelling

2 Training dataset retrieval

3 Machine learning (ie: SVM)

4 Result evaulation



Tag-spam detection in Social Bookmarking
systems



Problem description - Tag-spam

Social bookmarking system:
User can associate meta-information (tags) to resources (links);
Association of one o more words to any resource;

Advertiser:
Social tagging: posting link to his website tagging them with
content-related keywords

Spammer:
Most “famous” keywords (eg: music) used to tag not-related websites
(eg: his spam-websites);



Figure: Delicious.com Screenshot (2011)



Figure: Example: Tag-spam on Delicious.com (2008)



Problem description - Folksonomy

Data structure to represent a social tagging system;

Hyper-graph connecting users, resources and tags;

Symbols:
u ∈ U, U set of users;
r ∈ R, R set of resources;
t ∈ T , T set of tags;

post= {(u, r , t1 ), ..., (u, r , tn )} = {(u, r , (t1 , ..., tn ))}

F = {post1 , ..., postn }



Figure: Folksonomy graphical representation example



Features - Tag based

Which tags do spammers use?

TagSpam Ut = {u : (∃r : (u, r , t) ∈ F )}
St ∈ Ut , identiﬁed as spammer
|St |
Pr (t) = |Ut |

T (u, r ) = {t : (u, r , t) ∈ F }

1
fTagSpam (u, r ) = Pr (t)
|T (u, r )|
t∈T (u,r )



Features - Tag based

Is there as semantical relationship between tags?

TagBlur σ(t1 , t2 ) ∈ [0, 1], normalized tag similarity between t1 e
t2
Z = tag pairs in T( u, r )

1 1 1
fTagBlur (u, r ) = −
Z σ(t1 , t2 ) + 1+
t1 =t2 ∈T (u,r )



Features - Resource based I

DomFP Spammers use programs to generate pages → same
content for spam pages
We know the fingerprint of some spam pages
Compute the likelihood that r is spam comparing r
fingerprint to know ones

NumAds Usually spammers just offers lots of Ads
NumAds application exampe: count
googlesyndication.com amount in the resource html
code



Features - Resource based

Plagiarism Spammers usually copy content from high-ranked
websites
Compare r contents to other webpages

ValidLinks Spammer websites are frequently knocked down
Lots of invalid links posted by u implies greater
likelihood of u being spammer



Classification - Training dataset

BibSonomy.org :
public dataset

27.000 user and their post

hand made classification → 25.000 spammers and 2.000
legal users

Classification :
Binary classification into either spammer or not
spammer



Classiﬁcation - Results

SVM AdaBoost
Features Accuracy FP F1 Accuracy FP F1
TagSpam 95.82% .061 .957 94.66% .048 .943
+ TagBlur 96.75% .048 .966 96.06% .044 .958
+ DomFp 96.75% .048 .966 96.06% .044 .958
+ ValidLinks 96.52% .048 .964 96.75% .026 .965
+ NumAds 96.52% .048 .964 97.22% .026 .970
+ Plagiarism 96.75% .048 .966 98.38% 0.22 .983



YouTube Video Spamming



Description - Youtube video spam

Video-response: user answers to a video with another related video

Spammer: user answering with not related videos
Reasons:
increase video popularity
marketing campaign
pornography distribution
system poisoning

Issue: automatic content based spam recognition hard to implement



Description - Techniques

Content-based recognition:
video content analysis
too many computational resource
hard to generalize the idea of spam in a video, unless it doesn’t have
textual conent

Video and users relationship analysis:
lots of informations publicly available
spammers have speciﬁc social features (they’re lonely)
user behaviour towards spammers can be automatically analysed



Features - User-based

For each user:
# posted videos
# friends
# watched videos
# favourite videos
# video responses
# responded videos
# subscrition
# subscriber



Features - Video-based

2 category per user:
All posted videos
Just video responses
7 attributes each of them
# views
duration
# votes
# comments
# favourites
# youtube honours
# external links
Total and average for each attribute attribute, so 28 in total.



Features - Social network

Basate su Video response user graph:
directed graph (X,Y)
each user is a node in the graph
(x1 , x2 ) directed edge from x1 ∈ X to x2 ∈ Y if x1 ∈ X responded to
a video of x2 ∈ Y
Analysis:
in/out degree for each “user”
assortativity: degree(n) / avg( degree(neighbours(n)) )
userrank: depending on quantity and quality of in links
clustering coeﬃcient, betwenness, reciprocity



Classification - Dataset

Data crawling:
Starting from top-100 most responded video, retrieving connected data
concerning video responses, responded video e users.

Hand made classification:
Each user with at leas a video response not related to the responded video
is classified as spammer.

Test set:
473 legal users + 119 spammer = 592 users



Classiﬁcation - Training

Support Vector Machine

5-fold cross-validation

Adopted features:
user-based
video-based
social-network
all together



Classification - Results

Measure User Video SN All
TP 0.054 0.426 0.375 0.439
TN 0.998 0.922 1 0.981
FP 0.002 0.078 0 0.019
FN 0.946 0.574 0.625 0.561
Accuracy 0.821 0.821 0.874 0.870
F 0.094 0.484 0.540 0.558

TP = users correctly classified as spammers
FP = legal users classified as spammers



Conclusions



Conclusions

Classiﬁcations
Tag-spam recognition :
Accuracy > 98%
False positives < 2%
Youtube-video spam recognition :
True positives > 44%
False positives < 2%



Conclusions

Pro:
Few legal users classified as spammer
Tag-spam recognition finds most of the spammer
Dataset build out of publicly available information

Contro:
Social system already poisoned by spam
Hand made classification of training examples



References I

Brian D. Davison,
The Potential for Research and Development in Adversarial
Information Retrieval,
Computer Science and Engr., Lehigh University, Cambridge, 2009,
available at http://airweb.cse.lehigh.edu/2009/slides/
Davison-AIRWeb2009-Keynote.pdf.

B.Markines,C.Cattuto,F.Menczer,D.Benz,A.Hotho,and G. Stumme,
Evaluating similarity measures for emergent semantics of social
tagging,
In Proc. 18th Intl. WWW Conf., 2009,
available at http://www2009.org/proceedings/pdf/p641.pdf.

Benjamin Markines, Ciro Cattuto, Filippo Menczer,
Social Spam Detection, AIRWeb ’09, April 21, 2009 Madrid, Spain,
available at
http://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf.



References II

Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida,
Chao Zhang, Keith Ros,
Identifying Video Spammers in Online Social Networks,
AIRWeb ’08, April 22, 2008 Beijing, China,
available at http://airweb.cse.lehigh.edu/2008/submissions/
benevenuto_2008_spam_video.pdf.


Adversarial ID - Social spam recognition

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Adversarial ID - Social spam recognition

Semelhante a Adversarial ID - Social spam recognition (20)

Adversarial ID - Social spam recognition