10. 73% 71% 64% 56% 51% Positive ranking factors 68% 56% 51% 51% 46% Negative ranking factors Keyword focused anchor text from external links External link Popularity Diversity of link sources Keyword Use Anywhere in the title tag Trustworthiness of the Domain Based on Link Distance from Trusted Cloaking with Malicious intent Link acquisition from known link brokers Link from the page to Web Spam Pages Cloaking by User Agent Frequent Server Downtime & Site Inaccessibility
21. Page C has a higher PageRank than Page E, even though it has fewer links to it; the link it has is of a much higher value. A web surfer who chooses a random link on every page (but with 15% likelihood jumps to a random page on the whole web) is going to be on Page E for 8.1% of the time. (The 15% likelihood of jumping to an arbitrary page corresponds to a damping factor of 85%.) Without damping, all web surfers would eventually end up on Pages A, B, or C, and all other pages would have Page Rank zero. Page A is assumed to link to all pages in the web, because it has no outgoing links Mathematical Page Ranks
26. Thus based on the following features the content based spam pages can be detected by Naïve Bayesian Classifier which focuses on the no of times a word is repeated in the content of the page . Figure 1: Figure 2:
27.
28. It has been observed that a normal webpage have their graph of the supporter increasing exponentially and the number of supporters increases with the distance. But in the case of the web spam their graph has a sudden increase in the supporters over a small distance of time and decreasing to zero after some distance. The distribution of the supporters over the distance has been shown in the figure Distribution of supporters over a distance of the spam and non-spam page Non spam spam
29. System performance It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly In total it took roughly 9 days to download the 26 million pages (including errors) downloading the last11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.