SlideShare uma empresa Scribd logo
1 de 48
December 16, 2010 Database and Multimedia Lab Korea Advanced Institute of Science and Technology (KAIST) Improving the Quality of Web Spam Filtering by Using Seed Refinement Master Thesis Defense Presenter: Qureshi, Muhammad Atif Advisor: Whang, Kyu-Young
Contents ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011
Web Search Engine ,[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Introduction
Web Page Ranking ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Introduction
Link Structure of Web  [GGP04] ,[object Object],[object Object],[object Object],[object Object],[object Object],Introduction ,[object Object],[object Object],Fig. 1:  An example of a web graph. V  = { A ,  B ,  C } E  = { AB ,  BC } AB  is an outlink of the web node  A. BC  is an outlink of the web node  B. AB  is an inlink of the web node  B. BC  is an inlink of the web node  C. A C B
Web Page Ranking by Using the Link-based Methods ,[object Object],[object Object],[object Object],Jan 7, 2011 PR [ p ]: PageRank value of the web node  p N outlink ( q ): the number of outlinks of the web node  q d : damping factor (probability of following  an outlink) v [ p ]: the probability of random jump from the web node  p   to any arbitrary web node Introduction
Web Spam  [HMS02, GG05] ,[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Introduction N 3 N 4 N 1 N 2 The web nodes  N 1   and  N 2  are not involved in link spam, so they care called non-spam nodes … N 5 N x Web nodes  N 3 -N x  are involved in link spam, so they are called spam nodes Actor creates the web node  N 3  to  N x I want to boost the rank of the web node  N 3 Fig. 2:  An example of link spam. ,[object Object]
Web Spam Filtering Algorithm ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Introduction
Motivation and Goal ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Motivation and Goal
Contributions ,[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Contributions
Related Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Note:  Existing work exploit web graph whose web node represents a domain  [GBG06, WD05] . Related Work
TrustRank ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Fig. 3:  An example for explaining TrustRank. Related Work 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 t( 4)=1/3 A seed non-spam domain t ( i ): The trust score of domain  i The domain 3 gets trust scores from the domains 1 and 2. A domain being considered
Anti-TrustRank ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Fig. 4:  An example for explaining Anti-TrustRank. Related Work 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 at (4)=1/3 A seed spam domain at ( i ): The anti-trust score of domain  i The domain 3 gets anti-trust scores from the domains 1 and 2. A domain being considered
Spam Mass ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Fig. 5:  An example for explaining Spam Mass. Related Work 1 2 5 3 A seed non-spam domain A domain being considered The domain  5  receives many inlinks but only one  indirect inlink from a non-spam domain. 4 7 6
Link Farm Spam ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Related Work Fig. 6:  An example for explaining Link Farm Spam. 2 1 3 4 5 A domain being considered The domains 1, 3, and 4 have two directional links.
Web Spam Filtering Using Seed Refinement ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011
Modified TrustRank ,[object Object],[object Object],[object Object],Jan 7, 2011 Modifications 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 A seed non-spam domain t ( i ): The trust score of domain  i The domains 5 and 6 are involved in Web spam. A domain being considered t (5)= 5/12 + … 5 6 4 t (4)=1/3 t (6)= 5/12 + … 5/12 5/12 A seed spam domain Fig. 7:  An example explaining Modified TrustRank.
Modified Anti-TrustRank ,[object Object],[object Object],[object Object],Jan 7, 2011 Modifications 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 The domains 5 ,6 and 7 are non- spam domains. at (5)=5/12 at (6)=5/12 + … 5 6 a t ( i ): The anti-trust score of domain  i A domain being considered A seed spam domain 7 5/12 at (4)=1/3 5/12 5/12 at (7)=5/12 + … A seed non-spam domain Fig. 8:  An example explaining Modified Anti-TrustRank.
Modified Spam Mass ,[object Object],[object Object],[object Object],Jan 7, 2011 Modifications 1 2 5 3 A seed non-spam domain A domain being considered The domain  5  receives many inlinks 4 7 6 but only one  indirect inlink from a non-spam domain. A seed spam domain Fig. 9:  An example explaining Modified Spam Mass.
Modified Link Farm Spam ,[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Modifications 2 1 3 4 5 A domain being considered The domains 1, 3, and 4 have two directional links. Fig. 10:  An example explaining Modified Link Farm Spam. A seed non-spam domain 6 8 7
Strategy to Make Succession  of Modified Algorithms ,[object Object],[object Object],[object Object],Jan 7, 2011 Strategy Seed Refiner Spam Detector Detected  spam domains Class Data flow Refined  spam and non-spam  domains Manually labeled spam and non-spam domains Fig. 11:  The strategy of succession. ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Performance Evaluation ,[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Performance Evaluation Table. 1:  Summary of the experiments. Experimen tal Sets Experiment s Parameters Set 1:  Comparisons for showing the effect of refining seed Exp . 1 Comparison between  TR  (TrustRank)  and  MTR  (Modified TrustRank) cutoff Tr 0% − 300% ratio Top 10%, 50%, 100% damp 0.85 Exp . 2 Comparison between  ATR  (Anti-TrustRank)  and  MATR  (Modified Anti-TrustRank) cutoff ATr 0% − 300% ratio Top 10%, 50%, 100% damp 0.85 Exp . 3 Comparison between  SM   (Spam Mass)  and  MSM  (Modified Spam Mass) relativeMass 0.7 − 1.0 topPR 10%, 50%, 100% damp 0.85 Exp . 4 Comparison between  LFS   (Link Farm Spam)  and  MLFS  (Modified Link Farm Spam) limitBL 2 − 7 limitOL 2 − 7 Set 2:  Comparisons  for showing the effect of ordering executions Exp . 5 Finding the best succession for the seed refiner cutoff Tr 50%, 75%, 100% cutoff ATr 100% damp 0.85 Exp . 6 Finding the best succession for the spam detector relativeMass 0.8 − 0.99 topPR 100% limitBL 7 limitOL 7 damp 0.85 Exp . 7 Comparison among the best succession, the best known algorithm, and best modified algorithm relativeMass 0.8 − 0.99 topPR 100% limitBL 7 limitOL 7 damp 0.85
Experimental Parameters Jan 7, 2011 Table. 2:  Parameters used in experiments. Performance Evaluation Parameters Description damp It is a parameter used in  TR ,  MTR ,  ATR , and  MATR . It is the probability of following an outlink. Ratio Top It is the ratio for determining the input seed sets in  TR ,  MTR ,  ATR , and  MATR . Specifically, from Spam (or Non-Spam) Seed Set, we retrieve the domains whose PageRank scores are larger than or equal to the PageRank score of top- Ratio top % domain in the entire domains, and then, use the domains as the input seed set.  cutoff Tr It is the  cutoff  threshold in  TR  and  MTR  for declaring the number of non-spam domains. In this thesis, we decide the value of  cutoff Tr  proportional to the size of input seed set of the non-spam domains. cutoff ATr It is the  cutoff  threshold in  ATR  and  MATR  for declaring the number of spam domains. In this thesis, we decide the value of  cutoff ATr  proportional to the size of input seed set of the spam domains. relativeMass It is a threshold used in  SM  and  MSM  for deciding a domain as a spam such that, if the domain receives excessively higher spam score compared to the non-spam score, the domain is one of the candidates for Web spam. topPR It is a threshold used in  SM  and  MSM  for deciding the candidate of being a spam domain by comparing the PageRank score of the domain to be within the top percentage (i.e.,  topPR  %) of the PageRank scores. limitBL It is a threshold used in  LFS  and  MLFS  for declaring the domain as spam, if the number of bidirectional links of the domain is equal to or greater than this threshold. limiOL It is a threshold used in  LFS  and  MLFS  for declaring the domain as spam, if the number of outlinks of a domains pointing to the spam domains is equal to or greater than this threshold.
Experimental Data ,[object Object],Jan 7, 2011 Performance Evaluation Table. 3:  Characteristics of the data set in terms of domains and web pages . Table. 4:  Classification of the data set as Seed Set and Test Set . Domains Web Pages Labeled Spam 1,924 Total 77.9 Million Non-Spam 5,549 Unlabeled Unknown 3,929 Total  11,402 Seed Set Test Set Labeled Spam Domains 674 1,250 Labeled Non-Spam Domains 4,948 601
Jan 7, 2011 Experimental Measure Performance Evaluation Table. 5:  Description of the measures. 1 False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam). 1 Measures Description True positives The number of domains correctly labeled as belonging to the class (i.e., spam or non-spam).  [BCD08] False positives The number of domains incorrectly labeled as belonging to the class (i.e., spam or non-spam).  [BCD08] F - measure The combined representation of  precision  and  recall .  Precision, recall   [SM86] , and  F - measure  are expressed as follows. –
Comparison between Original and Modified Algorithms (1/3) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Performance Evaluation
Comparison between Original and Modified Algorithms (2/3) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Performance Evaluation
Comparison between Original and Modified Algorithms (3/3) ,[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Performance Evaluation
The Best Succession  for the Seed Refiner Jan 7, 2011 Therefore,  MATR-MTR  is found to be the winner, and hence we select it as the seed refiner. Performance Evaluation Identical performance for both successions Identical performance for both successions Identical performance for both successions Better performance for  MATR-MTR  compared to MTR-MATR Table. 6:  Comparison for the seed refiner. True Positives False Positives For Finding Refined Non-Spam Domains For Finding Refined Spam Domains
The Best Succession for the Spam Detector ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011 Performance Evaluation Fig. 12:  Comparison for the spam detector.
[object Object],[object Object],[object Object],[object Object],[object Object],Comparison among the Best Succession, the Best Known Algorithm and the Best Modified Algorithm Jan 7, 2011 Fig. 13:   Comparison among  MATR-MTR-MLFS-MSM ,  SM , and  MSM . Therefore,  MATR-MTR-MLFS-MSM  is more effective.
Conclusions ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011
References (1/2) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Jan 7, 2011
References (2/2) ,[object Object],[object Object],[object Object],[object Object],Jan 7, 2011
THANK YOU  VERY  MUCH! Jan 7, 2011
MTR Algorithm Jan 7, 2011 Supplement
MATR Algorithm Jan 7, 2011 Supplement
MSM Algorithm Jan 7, 2011 Supplement
MLFS Algorithm Jan 7, 2011 Supplement
TR  vs.  MTR Jan 7, 2011 Supplement Ratio Top  = 10% Ratio Top  =5 0% Ratio Top  = 100% (a) (b) (c) (d) (e) (f)
ATR  vs.  MATR Jan 7, 2011 Supplement Ratio Top  = 10% Ratio Top  =5 0% Ratio Top  = 100% (a) (b) (c) (d) (e) (f)
SM  vs.  MSM Jan 7, 2011 Supplement topPR =7 0% topPR =85 % topPR =10 0% (a) (b) (c) (d) (e) (f)
LFS  vs.  MLFS Jan 7, 2011 (a) (b) Supplement
[object Object],The Best Succession for the Spam Detector Jan 7, 2011 Fig x: Comparison for the spam detector The winner is  MSM  for Spam Detector. Supplement
[object Object],Comparison among the Best Succession, the Best Known Algorithm and Best Modified Algorithm Jan 7, 2011 MATR-MTR-MSM  is very effective compared to best known algorithm. Supplement
Possible Combinations for Seed Refinement Module Jan 7, 2011 Supplement Succession 1 ( MATR-MTR ) Succession 2 ( MTR-MATR ) MATR MTR Manual spam and non-spam   seed domains Manual non-spam domains and refined spam domains Manual spam and non-spam   seed domains MTR MATR Refined spam and non-spam seed domains Refined spam and non-spam seed domains Manual spam domains and refined non-spam domains Seed Refiner Seed Refiner Algorithm Class Data flow
Possible Combinations for Spam Detection Module Jan 7, 2011 Supplement Combinations Single Algorithm MLFS-MSM  MSM-MLFS MLFS MSM Succession 1 ( MLFS-MSM ) Succession 2 ( MSM-MLFS ) MLFS MSM Refined spam/non-spam   seed domains Spam domains and refined non-spam domains Refined spam/non-spam   seed domains MSM MLFS Detected spam domains Detected spam domains Spam domains and refined non-spam domains Spam Detector Spam Detector Algorithm Class Data flow
TR and ATR problem Jan 7, 2011 Supplement 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 A seed non-spam domain t ( i ): The trust score of domain  i The domains 5 and 6 are involved in Web spam. A domain being considered t (5)= 5/12 + … 5 6 4 t (4)=1/3 t (6)= 5/12 + … 5/12 5/12 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 The domains 5 ,6 and 7 are non- spam domains. at (5)=5/12 at (6)=5/12 + … 5 6 a t ( i ): The anti-trust score of domain  i A domain being considered A seed spam domain 7 5/12 at (4)=1/3 5/12 5/12 at (7)=5/12 + …

Mais conteúdo relacionado

Destaque

Evaluating Anti-Spam Filtering Solutions
Evaluating Anti-Spam Filtering SolutionsEvaluating Anti-Spam Filtering Solutions
Evaluating Anti-Spam Filtering SolutionsMichael Lamont
 
Telecom Spam Mathan Session2 08 Dec 06
Telecom Spam Mathan Session2 08 Dec 06Telecom Spam Mathan Session2 08 Dec 06
Telecom Spam Mathan Session2 08 Dec 06SANSEXPERT
 
Spam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta BhattacharyaSpam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta Bhattacharyasankhadeep
 
Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentationKaiwen Qi
 
Mba thesis defense presentation jve 2013.01.17
Mba thesis defense presentation jve 2013.01.17Mba thesis defense presentation jve 2013.01.17
Mba thesis defense presentation jve 2013.01.17Jean Vercruysse
 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense PresentationOnur Taylan
 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentationDr. Naomi Mangatu
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminarshilpi nagpal
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 
Quality control and inspection
Quality control and inspectionQuality control and inspection
Quality control and inspectionSamiksha Sawant
 
Prepare your Ph.D. Defense Presentation
Prepare your Ph.D. Defense PresentationPrepare your Ph.D. Defense Presentation
Prepare your Ph.D. Defense PresentationChristian Glahn
 

Destaque (13)

Evaluating Anti-Spam Filtering Solutions
Evaluating Anti-Spam Filtering SolutionsEvaluating Anti-Spam Filtering Solutions
Evaluating Anti-Spam Filtering Solutions
 
Telecom Spam Mathan Session2 08 Dec 06
Telecom Spam Mathan Session2 08 Dec 06Telecom Spam Mathan Session2 08 Dec 06
Telecom Spam Mathan Session2 08 Dec 06
 
Spam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta BhattacharyaSpam and Anti-spam - Sudipta Bhattacharya
Spam and Anti-spam - Sudipta Bhattacharya
 
MBA Defense Slides
MBA Defense SlidesMBA Defense Slides
MBA Defense Slides
 
Data mining project presentation
Data mining project presentationData mining project presentation
Data mining project presentation
 
Research presentation for mba
Research presentation for mbaResearch presentation for mba
Research presentation for mba
 
Mba thesis defense presentation jve 2013.01.17
Mba thesis defense presentation jve 2013.01.17Mba thesis defense presentation jve 2013.01.17
Mba thesis defense presentation jve 2013.01.17
 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense Presentation
 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentation
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Quality control and inspection
Quality control and inspectionQuality control and inspection
Quality control and inspection
 
Prepare your Ph.D. Defense Presentation
Prepare your Ph.D. Defense PresentationPrepare your Ph.D. Defense Presentation
Prepare your Ph.D. Defense Presentation
 

Semelhante a Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances goodJames Arnold
 
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMYA SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMYIJNSA Journal
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)James Arnold
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spamJames Arnold
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)James Arnold
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLIOSR Journals
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Reportweichen
 
Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConmattthemathman
 
Survey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamSurvey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamIRJET Journal
 
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfParthNavale
 
Detecting Phishing using Machine Learning
Detecting Phishing using Machine LearningDetecting Phishing using Machine Learning
Detecting Phishing using Machine Learningijtsrd
 

Semelhante a Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement (20)

CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances good
 
TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
 
50120140504017
5012014050401750120140504017
50120140504017
 
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMYA SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
 
I04015559
I04015559I04015559
I04015559
 
Page Rank Link Farm Detection
Page Rank Link Farm DetectionPage Rank Link Farm Detection
Page Rank Link Farm Detection
 
Web spam
Web spamWeb spam
Web spam
 
Transitivity of Trust
Transitivity of TrustTransitivity of Trust
Transitivity of Trust
 
Pagerank
PagerankPagerank
Pagerank
 
Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)Done rerea dlink-farm-spam(3)
Done rerea dlink-farm-spam(3)
 
Done rerea dlink-farm-spam
Done rerea dlink-farm-spamDone rerea dlink-farm-spam
Done rerea dlink-farm-spam
 
Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)Done rerea dlink-farm-spam(2)
Done rerea dlink-farm-spam(2)
 
Enhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOLEnhancement in Weighted PageRank Algorithm Using VOL
Enhancement in Weighted PageRank Algorithm Using VOL
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
 
Algorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozConAlgorithmic Web Spam detection - Matt Peters MozCon
Algorithmic Web Spam detection - Matt Peters MozCon
 
Survey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamSurvey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based Spam
 
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdfsystem-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
system-design-interview-an-insiders-guide-2nbsped-9798664653403.pdf
 
System Design
System DesignSystem Design
System Design
 
Detecting Phishing using Machine Learning
Detecting Phishing using Machine LearningDetecting Phishing using Machine Learning
Detecting Phishing using Machine Learning
 

Mais de M. Atif Qureshi

Utilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsUtilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsM. Atif Qureshi
 
Text mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipediaText mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipediaM. Atif Qureshi
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysisM. Atif Qureshi
 
Fundamentals of IR models
Fundamentals of IR modelsFundamentals of IR models
Fundamentals of IR modelsM. Atif Qureshi
 
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsExploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsM. Atif Qureshi
 
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...M. Atif Qureshi
 
Identifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereIdentifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereM. Atif Qureshi
 
Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureM. Atif Qureshi
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...M. Atif Qureshi
 

Mais de M. Atif Qureshi (10)

Utilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendationsUtilising wikipedia to explain recommendations
Utilising wikipedia to explain recommendations
 
Text mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipediaText mining, word embeddings, & wikipedia
Text mining, word embeddings, & wikipedia
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Fundamentals of IR models
Fundamentals of IR modelsFundamentals of IR models
Fundamentals of IR models
 
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in TweetsExploiting Wikipedia for Entity Name Disambiguation in Tweets
Exploiting Wikipedia for Entity Name Disambiguation in Tweets
 
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
A Perspective-Aware Approach to Search: Visualizing Perspectives in News Sear...
 
Welcoming Webology
Welcoming WebologyWelcoming Webology
Welcoming Webology
 
Identifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphereIdentifying and ranking topic clusters in the blogosphere
Identifying and ranking topic clusters in the blogosphere
 
Invent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel FutureInvent Episode 3: Tech Talk on Parallel Future
Invent Episode 3: Tech Talk on Parallel Future
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
 

Último

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement

  • 1. December 16, 2010 Database and Multimedia Lab Korea Advanced Institute of Science and Technology (KAIST) Improving the Quality of Web Spam Filtering by Using Seed Refinement Master Thesis Defense Presenter: Qureshi, Muhammad Atif Advisor: Whang, Kyu-Young
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. Experimental Parameters Jan 7, 2011 Table. 2: Parameters used in experiments. Performance Evaluation Parameters Description damp It is a parameter used in TR , MTR , ATR , and MATR . It is the probability of following an outlink. Ratio Top It is the ratio for determining the input seed sets in TR , MTR , ATR , and MATR . Specifically, from Spam (or Non-Spam) Seed Set, we retrieve the domains whose PageRank scores are larger than or equal to the PageRank score of top- Ratio top % domain in the entire domains, and then, use the domains as the input seed set. cutoff Tr It is the cutoff threshold in TR and MTR for declaring the number of non-spam domains. In this thesis, we decide the value of cutoff Tr proportional to the size of input seed set of the non-spam domains. cutoff ATr It is the cutoff threshold in ATR and MATR for declaring the number of spam domains. In this thesis, we decide the value of cutoff ATr proportional to the size of input seed set of the spam domains. relativeMass It is a threshold used in SM and MSM for deciding a domain as a spam such that, if the domain receives excessively higher spam score compared to the non-spam score, the domain is one of the candidates for Web spam. topPR It is a threshold used in SM and MSM for deciding the candidate of being a spam domain by comparing the PageRank score of the domain to be within the top percentage (i.e., topPR %) of the PageRank scores. limitBL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of bidirectional links of the domain is equal to or greater than this threshold. limiOL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of outlinks of a domains pointing to the spam domains is equal to or greater than this threshold.
  • 24.
  • 25. Jan 7, 2011 Experimental Measure Performance Evaluation Table. 5: Description of the measures. 1 False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam). 1 Measures Description True positives The number of domains correctly labeled as belonging to the class (i.e., spam or non-spam). [BCD08] False positives The number of domains incorrectly labeled as belonging to the class (i.e., spam or non-spam). [BCD08] F - measure The combined representation of precision and recall . Precision, recall [SM86] , and F - measure are expressed as follows. –
  • 26.
  • 27.
  • 28.
  • 29. The Best Succession for the Seed Refiner Jan 7, 2011 Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner. Performance Evaluation Identical performance for both successions Identical performance for both successions Identical performance for both successions Better performance for MATR-MTR compared to MTR-MATR Table. 6: Comparison for the seed refiner. True Positives False Positives For Finding Refined Non-Spam Domains For Finding Refined Spam Domains
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35. THANK YOU VERY MUCH! Jan 7, 2011
  • 36. MTR Algorithm Jan 7, 2011 Supplement
  • 37. MATR Algorithm Jan 7, 2011 Supplement
  • 38. MSM Algorithm Jan 7, 2011 Supplement
  • 39. MLFS Algorithm Jan 7, 2011 Supplement
  • 40. TR vs. MTR Jan 7, 2011 Supplement Ratio Top = 10% Ratio Top =5 0% Ratio Top = 100% (a) (b) (c) (d) (e) (f)
  • 41. ATR vs. MATR Jan 7, 2011 Supplement Ratio Top = 10% Ratio Top =5 0% Ratio Top = 100% (a) (b) (c) (d) (e) (f)
  • 42. SM vs. MSM Jan 7, 2011 Supplement topPR =7 0% topPR =85 % topPR =10 0% (a) (b) (c) (d) (e) (f)
  • 43. LFS vs. MLFS Jan 7, 2011 (a) (b) Supplement
  • 44.
  • 45.
  • 46. Possible Combinations for Seed Refinement Module Jan 7, 2011 Supplement Succession 1 ( MATR-MTR ) Succession 2 ( MTR-MATR ) MATR MTR Manual spam and non-spam seed domains Manual non-spam domains and refined spam domains Manual spam and non-spam seed domains MTR MATR Refined spam and non-spam seed domains Refined spam and non-spam seed domains Manual spam domains and refined non-spam domains Seed Refiner Seed Refiner Algorithm Class Data flow
  • 47. Possible Combinations for Spam Detection Module Jan 7, 2011 Supplement Combinations Single Algorithm MLFS-MSM MSM-MLFS MLFS MSM Succession 1 ( MLFS-MSM ) Succession 2 ( MSM-MLFS ) MLFS MSM Refined spam/non-spam seed domains Spam domains and refined non-spam domains Refined spam/non-spam seed domains MSM MLFS Detected spam domains Detected spam domains Spam domains and refined non-spam domains Spam Detector Spam Detector Algorithm Class Data flow
  • 48. TR and ATR problem Jan 7, 2011 Supplement 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 A seed non-spam domain t ( i ): The trust score of domain i The domains 5 and 6 are involved in Web spam. A domain being considered t (5)= 5/12 + … 5 6 4 t (4)=1/3 t (6)= 5/12 + … 5/12 5/12 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 The domains 5 ,6 and 7 are non- spam domains. at (5)=5/12 at (6)=5/12 + … 5 6 a t ( i ): The anti-trust score of domain i A domain being considered A seed spam domain 7 5/12 at (4)=1/3 5/12 5/12 at (7)=5/12 + …