SlideShare uma empresa Scribd logo
1 de 39
Baixar para ler offline
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References




                           Adversarial IR - Social Spamming

                                                 Nicola Miotto

                                             Unipd - Computer Science


                                              January 22, 2011




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        1 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References



  Outline

   1   Introduction
          Spam
          Adversarial IR
   2   Tag-spam detection in Social Bookmarking systems
         Problem description
         Features
         Classification
   3   Youtube Video Spamming
         Problem description
         Features
         Classificatio
   4   Conclusions
   5   References

Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        2 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References




                                             Introduction




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        3 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References



  Spam - History


          1970: BBC broadcasts the Spam sketch by Monty Python’s Flying
          Circus, where the current meaning of the term is derived;
          1978: advisory message sent to 393 ARPANET users, the earliest
          documented spam;
          ’90: Make Money Fast flooding around in many newsgroup. Frist
          association an IT related field of the term spam;
          1998: new definition for the term spam in the New Oxford Dictionary
          of English:
  Definition
  Irrelevant or inappropriate messages sent on the Internet to a large number
  of newsgroups or users.


Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        4 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References



  Spam - Fields



          E-mail

          Istant Messaging: Messaging spam

          Web-Search: Spamdexing

          Social systems: Social spam

          And so on...




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        5 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References



  Spam - Spammer


  Earn money on the web!
  Google AdSense or Heyos like services allow users to place Ad
  automatically generated in their web pages in order to get money from
  clicks and page impressions.


  Legal Avertiser:
          He produces web site where to put content-related Ad;

          He improves the pagerank of the website for the relevant keywork;

          Try to lead potential customers to his websites;



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        6 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References



  Spam - Spammer


  Spammer:
          Website contents just used to attract users and improve the pagerank;

          No discrimination between interested and not interested users;
          Authomatic spam-network generation programs:
                 they find the relevant keywords (eg: via AdWords)
                 they register the domain names containing those keywords;
                 they create complete websites with fake contents with the keywords
                 found;
                 they link the generated websites together in order to improve the
                 pagerank;



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        7 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References



  Spam - Social Spamming




  Spam campaign directed to Social Network users
          Social bookmarking systems: Delicious;
          Video social network: YouTube;
          General purpose social network: Facebook;
          and so on..




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        8 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions              References



  Spam - Social Spamming




  Features:
          Lots of user related information;
          Easier to point to a specific demographic segment;
          Cheaper (usually);
          Adopted solution (most of the times): Report abuse
          → generic solution, but less effective than ad-hoc ones.




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming         January 22, 2011        9 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Spam - Consequences




          Users hijacked towards areas out of their informative needs;


          unfair competition with legal advertiser


          Information poisoning due to the spam noise




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      10 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Adversarial IR - Definition




  Adversarial: “Assumes competing parties trying to affect the outcome of
  a system (system could be an algorithm, a market, etc)”




  Adversarial IR: “Information retrieval, ranking, or classification system
  affected by multiple parties acting in their own interest”




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      11 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Adversarial IR - AIRWeb


  AIRWeb
  Adversarial Information Retrieval on the Web


          Annual workshop about Adversarial IR


          Researchers and industry practitioners gathered to to present and
          discuss advances in the state-of-the-art of Adversarial IT


          First workshop in 2005 (Japan)



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      12 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Discussed techniques


  AIRWeb papers 42
            Social spam recognition techniques discussed during the
            AIRWeb workshops

  Supervised Machine Learning 42
                           1   Feature modelling

                           2   Training dataset retrieval

                           3   Machine learning (ie: SVM)

                           4   Result evaulation



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      13 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References




   Tag-spam detection in Social Bookmarking
                   systems




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      14 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Problem description - Tag-spam

  Social bookmarking system:
          User can associate meta-information (tags) to resources (links);
          Association of one o more words to any resource;


  Advertiser:
          Social tagging: posting link to his website tagging them with
          content-related keywords


  Spammer:
          Most “famous” keywords (eg: music) used to tag not-related websites
          (eg: his spam-websites);


Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      15 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References




                                Figure: Delicious.com Screenshot (2011)


Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      16 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References




                       Figure: Example: Tag-spam on Delicious.com (2008)




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      17 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Problem description - Folksonomy


          Data structure to represent a social tagging system;

          Hyper-graph connecting users, resources and tags;

          Symbols:
                 u ∈ U, U set of users;
                 r ∈ R, R set of resources;
                 t ∈ T , T set of tags;

          post= {(u, r , t1 ), ..., (u, r , tn )} = {(u, r , (t1 , ..., tn ))}

          F = {post1 , ..., postn }



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      18 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References




                       Figure: Folksonomy graphical representation example


Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      19 / 39
Introduction          Tag-spam detection           Youtube Video Spamming        Conclusions              References



  Features - Tag based


                                     Which tags do spammers use?

      TagSpam                  Ut = {u : (∃r : (u, r , t) ∈ F )}
                               St ∈ Ut , identified as spammer
                                             |St |
                               Pr (t) =      |Ut |

                               T (u, r ) = {t : (u, r , t) ∈ F }

                                                         1
                 fTagSpam (u, r ) =                                                      Pr (t)
                                                     |T (u, r )|
                                                                            t∈T (u,r )


Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming            January 22, 2011      20 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Features - Tag based



                      Is there as semantical relationship between tags?

         TagBlur               σ(t1 , t2 ) ∈ [0, 1], normalized tag similarity between t1 e
                               t2
                               Z = tag pairs in T( u, r )

                                             1                             1         1
                   fTagBlur (u, r ) =                                             −
                                             Z                      σ(t1 , t2 ) +   1+
                                                 t1 =t2 ∈T (u,r )




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      21 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Features - Resource based I


         DomFP                 Spammers use programs to generate pages → same
                               content for spam pages
                               We know the fingerprint of some spam pages
                               Compute the likelihood that r is spam comparing r
                               fingerprint to know ones

       NumAds                  Usually spammers just offers lots of Ads
                               NumAds application exampe: count
                               googlesyndication.com amount in the resource html
                               code



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      22 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Features - Resource based



    Plagiarism                 Spammers usually copy content from high-ranked
                               websites
                               Compare r contents to other webpages

     ValidLinks                Spammer websites are frequently knocked down
                               Lots of invalid links posted by u implies greater
                               likelihood of u being spammer




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      23 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Classification - Training dataset


  BibSonomy.org :
                               public dataset

                               27.000 user and their post

                               hand made classification → 25.000 spammers and 2.000
                               legal users


  Classification :
                               Binary classification into either spammer or not
                               spammer



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      24 / 39
Introduction          Tag-spam detection           Youtube Video Spamming       Conclusions             References



  Classification - Results



                                           SVM                                   AdaBoost
            Features               Accuracy FP                   F1        Accuracy FP            F1
            TagSpam                 95.82% .061                 .957        94.66% .048          .943
            + TagBlur               96.75% .048                 .966        96.06% .044          .958
            + DomFp                 96.75% .048                 .966        96.06% .044          .958
            + ValidLinks            96.52% .048                 .964        96.75% .026          .965
            + NumAds                96.52% .048                 .964        97.22% .026          .970
            + Plagiarism            96.75% .048                 .966        98.38% 0.22          .983




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming          January 22, 2011      25 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References




                         YouTube Video Spamming




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      26 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Description - Youtube video spam



          Video-response: user answers to a video with another related video

          Spammer: user answering with not related videos
          Reasons:
                 increase video popularity
                 marketing campaign
                 pornography distribution
                 system poisoning

          Issue: automatic content based spam recognition hard to implement




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      27 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Description - Techniques



          Content-based recognition:
                 video content analysis
                 too many computational resource
                 hard to generalize the idea of spam in a video, unless it doesn’t have
                 textual conent

          Video and users relationship analysis:
                 lots of informations publicly available
                 spammers have specific social features (they’re lonely)
                 user behaviour towards spammers can be automatically analysed




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      28 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Features - User-based


  For each user:
          # posted videos
          # friends
          # watched videos
          # favourite videos
          # video responses
          # responded videos
          # subscrition
          # subscriber




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      29 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Features - Video-based

  2 category per user:
          All posted videos
          Just video responses
  7 attributes each of them
          # views
          duration
          # votes
          # comments
          # favourites
          # youtube honours
          # external links
  Total and average for each attribute attribute, so 28 in total.

Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      30 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Features - Social network


  Basate su Video response user graph:
          directed graph (X,Y)
          each user is a node in the graph
          (x1 , x2 ) directed edge from x1 ∈ X to x2 ∈ Y if x1 ∈ X responded to
          a video of x2 ∈ Y
  Analysis:
          in/out degree for each “user”
          assortativity: degree(n) / avg( degree(neighbours(n)) )
          userrank: depending on quantity and quality of in links
          clustering coefficient, betwenness, reciprocity



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      31 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Classification - Dataset


  Data crawling:
  Starting from top-100 most responded video, retrieving connected data
  concerning video responses, responded video e users.


  Hand made classification:
  Each user with at leas a video response not related to the responded video
  is classified as spammer.


  Test set:
  473 legal users + 119 spammer = 592 users



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      32 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Classification - Training



          Support Vector Machine


          5-fold cross-validation

          Adopted features:
                 user-based
                 video-based
                 social-network
                 all together




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      33 / 39
Introduction          Tag-spam detection           Youtube Video Spamming       Conclusions             References



  Classification - Results


                            Measure          User          Video          SN      All
                              TP             0.054         0.426         0.375   0.439
                              TN             0.998         0.922           1     0.981
                              FP             0.002         0.078           0     0.019
                              FN             0.946         0.574         0.625   0.561
                            Accuracy         0.821         0.821         0.874   0.870
                               F             0.094         0.484         0.540   0.558


                          TP = users correctly classified as spammers
                           FP = legal users classified as spammers



Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming          January 22, 2011      34 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References




                                             Conclusions




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      35 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Conclusions



  Classifications
  Tag-spam recognition :
                 Accuracy > 98%
                 False positives < 2%
  Youtube-video spam recognition :
                  True positives > 44%
                  False positives < 2%




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      36 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  Conclusions


  Pro:
          Few legal users classified as spammer
          Tag-spam recognition finds most of the spammer
          Dataset build out of publicly available information


  Contro:
          Social system already poisoned by spam
          Hand made classification of training examples




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      37 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  References I

          Brian D. Davison,
          The Potential for Research and Development in Adversarial
          Information Retrieval,
          Computer Science and Engr., Lehigh University, Cambridge, 2009,
          available at http://airweb.cse.lehigh.edu/2009/slides/
          Davison-AIRWeb2009-Keynote.pdf.

          B.Markines,C.Cattuto,F.Menczer,D.Benz,A.Hotho,and G. Stumme,
          Evaluating similarity measures for emergent semantics of social
          tagging,
          In Proc. 18th Intl. WWW Conf., 2009,
          available at http://www2009.org/proceedings/pdf/p641.pdf.

          Benjamin Markines, Ciro Cattuto, Filippo Menczer,
          Social Spam Detection, AIRWeb ’09, April 21, 2009 Madrid, Spain,
          available at
          http://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf.

Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      38 / 39
Introduction          Tag-spam detection           Youtube Video Spamming     Conclusions             References



  References II




          Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida,
          Chao Zhang, Keith Ros,
          Identifying Video Spammers in Online Social Networks,
          AIRWeb ’08, April 22, 2008 Beijing, China,
          available at http://airweb.cse.lehigh.edu/2008/submissions/
          benevenuto_2008_spam_video.pdf.




Nicola Miotto (Unipd - Computer Science)     Adversarial IR - Social Spamming        January 22, 2011      39 / 39

Mais conteúdo relacionado

Semelhante a Adversarial ID - Social spam recognition

Online Python Resources
Online Python ResourcesOnline Python Resources
Online Python ResourcesJonathan Fine
 
Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?
Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?
Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?Joe Dawson
 
Jeremy Toeman's presentation at eComm 2008
Jeremy Toeman's presentation at eComm 2008Jeremy Toeman's presentation at eComm 2008
Jeremy Toeman's presentation at eComm 2008eComm2008
 
Comments on YouTube Videos: Understanding the Role of Anonymity
Comments on YouTube Videos: Understanding the Role of AnonymityComments on YouTube Videos: Understanding the Role of Anonymity
Comments on YouTube Videos: Understanding the Role of AnonymityM. Laeeq Khan
 
End User Development in the IoT: a Semantic Approach
End User Development in the IoT: a Semantic ApproachEnd User Development in the IoT: a Semantic Approach
End User Development in the IoT: a Semantic ApproachAlberto Monge Roffarello
 
Finding media illustrating events
Finding media illustrating eventsFinding media illustrating events
Finding media illustrating eventsRaphael Troncy
 
Cognitive approach for social engineering (How to force smart people to do du...
Cognitive approach for social engineering (How to force smart people to do du...Cognitive approach for social engineering (How to force smart people to do du...
Cognitive approach for social engineering (How to force smart people to do du...Enrico Frumento
 
Enhancing Cybersecurity Readiness Through International Cooperation
Enhancing Cybersecurity Readiness Through International CooperationEnhancing Cybersecurity Readiness Through International Cooperation
Enhancing Cybersecurity Readiness Through International CooperationPositive Hack Days
 
Chiesa_ Isecom
Chiesa_ IsecomChiesa_ Isecom
Chiesa_ IsecomGoWireless
 
Social Media Technologies, part A of 2
Social Media Technologies, part A of 2Social Media Technologies, part A of 2
Social Media Technologies, part A of 2Paolo Nesi
 
Day3 youtube(1)-1
Day3 youtube(1)-1Day3 youtube(1)-1
Day3 youtube(1)-1ds2426
 
Counterfeit Products Please respond to the followingFrom the fi.docx
Counterfeit Products Please respond to the followingFrom the fi.docxCounterfeit Products Please respond to the followingFrom the fi.docx
Counterfeit Products Please respond to the followingFrom the fi.docxvictorring
 
Tutorial on Social Multimedia Computing
Tutorial on Social Multimedia ComputingTutorial on Social Multimedia Computing
Tutorial on Social Multimedia ComputingJitao Sang
 

Semelhante a Adversarial ID - Social spam recognition (20)

E05742630
E05742630E05742630
E05742630
 
G05913234
G05913234G05913234
G05913234
 
Online Python Resources
Online Python ResourcesOnline Python Resources
Online Python Resources
 
Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?
Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?
Does iTunes Provide Everything You Need To Be Entertained. Anywhere. Anytime?
 
Jeremy Toeman's presentation at eComm 2008
Jeremy Toeman's presentation at eComm 2008Jeremy Toeman's presentation at eComm 2008
Jeremy Toeman's presentation at eComm 2008
 
Day3 youtubepdf
Day3 youtubepdfDay3 youtubepdf
Day3 youtubepdf
 
L19 Social
L19 SocialL19 Social
L19 Social
 
Comments on YouTube Videos: Understanding the Role of Anonymity
Comments on YouTube Videos: Understanding the Role of AnonymityComments on YouTube Videos: Understanding the Role of Anonymity
Comments on YouTube Videos: Understanding the Role of Anonymity
 
End User Development in the IoT: a Semantic Approach
End User Development in the IoT: a Semantic ApproachEnd User Development in the IoT: a Semantic Approach
End User Development in the IoT: a Semantic Approach
 
Research
ResearchResearch
Research
 
Finding media illustrating events
Finding media illustrating eventsFinding media illustrating events
Finding media illustrating events
 
In class powerpoint
In class powerpointIn class powerpoint
In class powerpoint
 
Day3 youtube
Day3 youtubeDay3 youtube
Day3 youtube
 
Cognitive approach for social engineering (How to force smart people to do du...
Cognitive approach for social engineering (How to force smart people to do du...Cognitive approach for social engineering (How to force smart people to do du...
Cognitive approach for social engineering (How to force smart people to do du...
 
Enhancing Cybersecurity Readiness Through International Cooperation
Enhancing Cybersecurity Readiness Through International CooperationEnhancing Cybersecurity Readiness Through International Cooperation
Enhancing Cybersecurity Readiness Through International Cooperation
 
Chiesa_ Isecom
Chiesa_ IsecomChiesa_ Isecom
Chiesa_ Isecom
 
Social Media Technologies, part A of 2
Social Media Technologies, part A of 2Social Media Technologies, part A of 2
Social Media Technologies, part A of 2
 
Day3 youtube(1)-1
Day3 youtube(1)-1Day3 youtube(1)-1
Day3 youtube(1)-1
 
Counterfeit Products Please respond to the followingFrom the fi.docx
Counterfeit Products Please respond to the followingFrom the fi.docxCounterfeit Products Please respond to the followingFrom the fi.docx
Counterfeit Products Please respond to the followingFrom the fi.docx
 
Tutorial on Social Multimedia Computing
Tutorial on Social Multimedia ComputingTutorial on Social Multimedia Computing
Tutorial on Social Multimedia Computing
 

Adversarial ID - Social spam recognition

  • 1. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Adversarial IR - Social Spamming Nicola Miotto Unipd - Computer Science January 22, 2011 Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 1 / 39
  • 2. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Outline 1 Introduction Spam Adversarial IR 2 Tag-spam detection in Social Bookmarking systems Problem description Features Classification 3 Youtube Video Spamming Problem description Features Classificatio 4 Conclusions 5 References Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 2 / 39
  • 3. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Introduction Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 3 / 39
  • 4. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Spam - History 1970: BBC broadcasts the Spam sketch by Monty Python’s Flying Circus, where the current meaning of the term is derived; 1978: advisory message sent to 393 ARPANET users, the earliest documented spam; ’90: Make Money Fast flooding around in many newsgroup. Frist association an IT related field of the term spam; 1998: new definition for the term spam in the New Oxford Dictionary of English: Definition Irrelevant or inappropriate messages sent on the Internet to a large number of newsgroups or users. Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 4 / 39
  • 5. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Spam - Fields E-mail Istant Messaging: Messaging spam Web-Search: Spamdexing Social systems: Social spam And so on... Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 5 / 39
  • 6. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Spam - Spammer Earn money on the web! Google AdSense or Heyos like services allow users to place Ad automatically generated in their web pages in order to get money from clicks and page impressions. Legal Avertiser: He produces web site where to put content-related Ad; He improves the pagerank of the website for the relevant keywork; Try to lead potential customers to his websites; Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 6 / 39
  • 7. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Spam - Spammer Spammer: Website contents just used to attract users and improve the pagerank; No discrimination between interested and not interested users; Authomatic spam-network generation programs: they find the relevant keywords (eg: via AdWords) they register the domain names containing those keywords; they create complete websites with fake contents with the keywords found; they link the generated websites together in order to improve the pagerank; Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 7 / 39
  • 8. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Spam - Social Spamming Spam campaign directed to Social Network users Social bookmarking systems: Delicious; Video social network: YouTube; General purpose social network: Facebook; and so on.. Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 8 / 39
  • 9. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Spam - Social Spamming Features: Lots of user related information; Easier to point to a specific demographic segment; Cheaper (usually); Adopted solution (most of the times): Report abuse → generic solution, but less effective than ad-hoc ones. Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 9 / 39
  • 10. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Spam - Consequences Users hijacked towards areas out of their informative needs; unfair competition with legal advertiser Information poisoning due to the spam noise Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 10 / 39
  • 11. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Adversarial IR - Definition Adversarial: “Assumes competing parties trying to affect the outcome of a system (system could be an algorithm, a market, etc)” Adversarial IR: “Information retrieval, ranking, or classification system affected by multiple parties acting in their own interest” Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 11 / 39
  • 12. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Adversarial IR - AIRWeb AIRWeb Adversarial Information Retrieval on the Web Annual workshop about Adversarial IR Researchers and industry practitioners gathered to to present and discuss advances in the state-of-the-art of Adversarial IT First workshop in 2005 (Japan) Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 12 / 39
  • 13. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Discussed techniques AIRWeb papers 42 Social spam recognition techniques discussed during the AIRWeb workshops Supervised Machine Learning 42 1 Feature modelling 2 Training dataset retrieval 3 Machine learning (ie: SVM) 4 Result evaulation Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 13 / 39
  • 14. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Tag-spam detection in Social Bookmarking systems Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 14 / 39
  • 15. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Problem description - Tag-spam Social bookmarking system: User can associate meta-information (tags) to resources (links); Association of one o more words to any resource; Advertiser: Social tagging: posting link to his website tagging them with content-related keywords Spammer: Most “famous” keywords (eg: music) used to tag not-related websites (eg: his spam-websites); Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 15 / 39
  • 16. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Figure: Delicious.com Screenshot (2011) Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 16 / 39
  • 17. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Figure: Example: Tag-spam on Delicious.com (2008) Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 17 / 39
  • 18. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Problem description - Folksonomy Data structure to represent a social tagging system; Hyper-graph connecting users, resources and tags; Symbols: u ∈ U, U set of users; r ∈ R, R set of resources; t ∈ T , T set of tags; post= {(u, r , t1 ), ..., (u, r , tn )} = {(u, r , (t1 , ..., tn ))} F = {post1 , ..., postn } Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 18 / 39
  • 19. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Figure: Folksonomy graphical representation example Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 19 / 39
  • 20. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Features - Tag based Which tags do spammers use? TagSpam Ut = {u : (∃r : (u, r , t) ∈ F )} St ∈ Ut , identified as spammer |St | Pr (t) = |Ut | T (u, r ) = {t : (u, r , t) ∈ F } 1 fTagSpam (u, r ) = Pr (t) |T (u, r )| t∈T (u,r ) Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 20 / 39
  • 21. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Features - Tag based Is there as semantical relationship between tags? TagBlur σ(t1 , t2 ) ∈ [0, 1], normalized tag similarity between t1 e t2 Z = tag pairs in T( u, r ) 1 1 1 fTagBlur (u, r ) = − Z σ(t1 , t2 ) + 1+ t1 =t2 ∈T (u,r ) Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 21 / 39
  • 22. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Features - Resource based I DomFP Spammers use programs to generate pages → same content for spam pages We know the fingerprint of some spam pages Compute the likelihood that r is spam comparing r fingerprint to know ones NumAds Usually spammers just offers lots of Ads NumAds application exampe: count googlesyndication.com amount in the resource html code Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 22 / 39
  • 23. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Features - Resource based Plagiarism Spammers usually copy content from high-ranked websites Compare r contents to other webpages ValidLinks Spammer websites are frequently knocked down Lots of invalid links posted by u implies greater likelihood of u being spammer Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 23 / 39
  • 24. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Classification - Training dataset BibSonomy.org : public dataset 27.000 user and their post hand made classification → 25.000 spammers and 2.000 legal users Classification : Binary classification into either spammer or not spammer Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 24 / 39
  • 25. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Classification - Results SVM AdaBoost Features Accuracy FP F1 Accuracy FP F1 TagSpam 95.82% .061 .957 94.66% .048 .943 + TagBlur 96.75% .048 .966 96.06% .044 .958 + DomFp 96.75% .048 .966 96.06% .044 .958 + ValidLinks 96.52% .048 .964 96.75% .026 .965 + NumAds 96.52% .048 .964 97.22% .026 .970 + Plagiarism 96.75% .048 .966 98.38% 0.22 .983 Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 25 / 39
  • 26. Introduction Tag-spam detection Youtube Video Spamming Conclusions References YouTube Video Spamming Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 26 / 39
  • 27. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Description - Youtube video spam Video-response: user answers to a video with another related video Spammer: user answering with not related videos Reasons: increase video popularity marketing campaign pornography distribution system poisoning Issue: automatic content based spam recognition hard to implement Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 27 / 39
  • 28. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Description - Techniques Content-based recognition: video content analysis too many computational resource hard to generalize the idea of spam in a video, unless it doesn’t have textual conent Video and users relationship analysis: lots of informations publicly available spammers have specific social features (they’re lonely) user behaviour towards spammers can be automatically analysed Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 28 / 39
  • 29. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Features - User-based For each user: # posted videos # friends # watched videos # favourite videos # video responses # responded videos # subscrition # subscriber Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 29 / 39
  • 30. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Features - Video-based 2 category per user: All posted videos Just video responses 7 attributes each of them # views duration # votes # comments # favourites # youtube honours # external links Total and average for each attribute attribute, so 28 in total. Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 30 / 39
  • 31. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Features - Social network Basate su Video response user graph: directed graph (X,Y) each user is a node in the graph (x1 , x2 ) directed edge from x1 ∈ X to x2 ∈ Y if x1 ∈ X responded to a video of x2 ∈ Y Analysis: in/out degree for each “user” assortativity: degree(n) / avg( degree(neighbours(n)) ) userrank: depending on quantity and quality of in links clustering coefficient, betwenness, reciprocity Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 31 / 39
  • 32. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Classification - Dataset Data crawling: Starting from top-100 most responded video, retrieving connected data concerning video responses, responded video e users. Hand made classification: Each user with at leas a video response not related to the responded video is classified as spammer. Test set: 473 legal users + 119 spammer = 592 users Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 32 / 39
  • 33. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Classification - Training Support Vector Machine 5-fold cross-validation Adopted features: user-based video-based social-network all together Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 33 / 39
  • 34. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Classification - Results Measure User Video SN All TP 0.054 0.426 0.375 0.439 TN 0.998 0.922 1 0.981 FP 0.002 0.078 0 0.019 FN 0.946 0.574 0.625 0.561 Accuracy 0.821 0.821 0.874 0.870 F 0.094 0.484 0.540 0.558 TP = users correctly classified as spammers FP = legal users classified as spammers Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 34 / 39
  • 35. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Conclusions Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 35 / 39
  • 36. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Conclusions Classifications Tag-spam recognition : Accuracy > 98% False positives < 2% Youtube-video spam recognition : True positives > 44% False positives < 2% Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 36 / 39
  • 37. Introduction Tag-spam detection Youtube Video Spamming Conclusions References Conclusions Pro: Few legal users classified as spammer Tag-spam recognition finds most of the spammer Dataset build out of publicly available information Contro: Social system already poisoned by spam Hand made classification of training examples Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 37 / 39
  • 38. Introduction Tag-spam detection Youtube Video Spamming Conclusions References References I Brian D. Davison, The Potential for Research and Development in Adversarial Information Retrieval, Computer Science and Engr., Lehigh University, Cambridge, 2009, available at http://airweb.cse.lehigh.edu/2009/slides/ Davison-AIRWeb2009-Keynote.pdf. B.Markines,C.Cattuto,F.Menczer,D.Benz,A.Hotho,and G. Stumme, Evaluating similarity measures for emergent semantics of social tagging, In Proc. 18th Intl. WWW Conf., 2009, available at http://www2009.org/proceedings/pdf/p641.pdf. Benjamin Markines, Ciro Cattuto, Filippo Menczer, Social Spam Detection, AIRWeb ’09, April 21, 2009 Madrid, Spain, available at http://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf. Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 38 / 39
  • 39. Introduction Tag-spam detection Youtube Video Spamming Conclusions References References II Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida, Chao Zhang, Keith Ros, Identifying Video Spammers in Online Social Networks, AIRWeb ’08, April 22, 2008 Beijing, China, available at http://airweb.cse.lehigh.edu/2008/submissions/ benevenuto_2008_spam_video.pdf. Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 39 / 39