SlideShare a Scribd company logo
1 of 8
Download to read offline
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 198 | P a g e
Content based web spam detection using naive bayes with
different feature representation technique
Amit Anand Soni1
, Abhishek Mathur2
Research Scholar SATI Engg. Vidisha (M.P.)
Asst. Professor SATI Engg Vidisha (M.P.)
Abstract
Web Spam Detection is the processing to organize the search result according to specified criteria. Most
often this refers to the automatic processing of search result, but the term also applies to the automatic classification
of search results into ham and spam. Our work also evaluates change in performance by using different
representation for the document vector like term frequency (TF), Binary, inverse document frequency (IDF) and
TF-IDF. There are various Benchmark Datasets available for researchers related to web spam filtering. There has
been significant effort to generate public benchmark datasets for anti- web spam filtering. One of the main
concerns is how to protect the privacy of the users whose ham links are included in the datasets.
We perform a statistical analysis of a large collection of WebPages, focusing on spam detection.
Dimension reduction is important part of classification because it provides ease to visualize high dimensional data.
This work reduce dimension of training data in 2D and full and mapped training and test data in to vector space.
There are several classification here we use Naive Bayes classification and train data set with varying different
representation and testing perform with different spam ham ratio
Key-Words: - Content spam, keyword count, variety, density and Hidden or invisible text
I. INTRODUCTION
Search engines are widely used tools for
effectively exploring information on the Web. One of
the core components of a search engine is its ranking
function: when a search engine receives a user query,
this function determines the order of presentation of
retrieved results (documents or web URLs). The main
goal of the ranking process is to promote high-quality
and relevant content to the top of the result list, which
is an important and challenging problem by itself. In
this work we propose a method for improving the
quality of ranking of search results that addresses the
two important aspects mentioned above through the
temporal analysis of search logs.
First, we identify an interesting link between
email spam and Web spam, and we use this link to
propose a novel technique for extracting large Web
spam samples from the Web. Then, we present the
Webb Spam Corpus – a first-of-its-kind, large-scale,
and publicly available Web spam data set that was
created using our automated Web spam collection
method.
While performing our classifier evaluations,
we identified a clear tension between spam producers
and information consumers. Spam producers are
constantly evolving their technique to ensure their
spam messages are delivered, and information
consumers are constantly evolving their
countermeasures to ensure they don’t receive spam
messages. Based on the results of our evolutionary
study, we began to question the validity of retraining
as a solution for camouflaged messages. Since
spammers continually evolve their techniques, we
believed they would also evolve their camouflaged
messages, making them more sophisticated over time.
This process continues until both parties are firmly
entrenched in a spam arms race. Fortunately, in this
thesis, we propose two solutions that allow
information consumers to break free of this arms race.
The second contribution of this thesis is a
framework for collecting, analyzing, and classifying
examples of Spam attacks in the World Wide Web.
Just as email spam has negatively impacted the user
messaging experience, the rise of Web spam is
threatening to severely degrade the quality of
information on the World Wide Web. Fundamentally,
Web spam is designed to pollute search engines and
corrupt the user experience by driving traffic to
particular spammed Web pages, regardless of the
merits of those pages. Hence, we present various
techniques for automatically identifying and removing
these pages from the Web.
II. RELATEDWORK
In this section, we provide an overview of
previous efforts to improve the ranking of search
results by introducing a better ranking function or a
method to detect and eliminate adversarial content, the
two major research directions, highly relevant to the
present work. The learning-to-rank approaches are
capable of combining different kinds of features to
train the ranking function. A number of previous
RESEARCH ARTICLE OPEN ACCESS
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 199 | P a g e
works have also focused on exploring the methods to
obtain useful information from click-through data,
which could benefit search relevance
2.1 Statistical Classification of Email Spam
Email classification can be characterized as
the problem of assigning a boolean value (“spam” or
“legitimate”) to each email message M in a collection
of email messages M. More formally, the task of spam
classification is to approximate the unknown target
function Φ: M! {Spam, legitimate}, which describes
how messages are to be classified, by means of a
function Ǿ: M! {Spam, legitimate} called the classifier
(or model), such that Φ and Ǿ coincide as much as
possible.
Different learning methods have been
explored by the research community for building spam
classifiers (also called spam filters). In our email spam
experiments, we focus on three learning algorithms:
Na¨ıve Bayes, Support Vector Machines (SVM), and
LogitBoost. In the following sections, we will briefly
summarize the important details of each of these
algorithms.
2.1.1 Naive Bayes
Naive Bayes is one of the simplest
classification methods in machine learning. This work
use NB because of it takes less training time and Very
easy to deal with missing attributes. In the experiments
each message is represented as a vector Vi= {T1. .
.Tm}( Vi is a feature vector of document i) where T1. .
.Tm are the feature and Wi1, Wi2.....Wim are the
weight of term T1. . .Tm. We are doing spam filtering
in which we have only two classes.
Given a classification task of 2 classes C1,
C2 and an unknown pattern, which is represented by a
feature vector V, form the two conditional
probabilities for i=1, 2 Sometimes, these are
also referred to as a posteriori probabilities. In words,
each of them represents the probability that the
unknown pattern belongs to the respective class Ci.
Let C1 (spam), C2 (ham) be the two classes in which
message belong. Assume that the a priori probabilities
P (C1), P (C2) are known. If P (C1), P (C2) are
unknown than easily calculated from training dataset.
If N total number of mails (spam ham) in training
dataset in which N1 belongs to C1 (spam) class and
N2 belongs to C2 (ham) class then
Now compute conditional probability.
Where p (V) is the pdf of V
The Bayes classification rule can now be stated as
If > , V is classified to
If < , V is classified to
In case of both are equal then we assign vector X in
either class.
Here we don’t consider p (V), because it is same for
all classes. If the a priori probabilities are equal
Than
2.1.2 Dimension reduction:
DR is important part of classification because
it provides ease to visualize high dimensional data.
Singular Value Decomposition (SVD):
Data set representation in the form of term document
matrix that represents n number of document and m
number of term that describe every document.
Suppose A is a document term matrix of nxm matrix
of data set A, Aij shows the feature j for documents i.
Every row of A represented by document (vector of
term with m dimension) and number of column called
dimension of vector.
Mathematical decomposition of matrix:
Mathematically matrix A of nxm is
decomposing into three parts. Decomposition of
matrix is given below.
Here,
d: Represent number of document.
t: Represent number of term in document vector.
A [d x t] = U [d x t] *S [t x t] *(V [t x t]) T
Decomposition of matrix using SVD
Preprocessing of Dataset:
The data set is subjected to the pre-
processing. The dataset contains two labeled files
which show that the link is spam or normal. From
these files constructed our data. Link belongs to which
category known to us so it can be easily separable.
Wrote a program to extract the content of the pages
and save the result into a corresponding text files.
Generate a sparse matrix which contains the
observation and features. Observations are rows and
features are columns.
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 200 | P a g e
Table Train Dataset
Datase
ts Training Spa
m:
Ham
ratio
Tot
al
Spa
m
Ha
m
Datase
t1
449
6
436
1 1:1
885
7
Table Test Dataset
Datasets Testing Spam:
Ham
ratio
Total
Spam Ham
Dataset1 4500 4500
1:1
9000
Dataset2 3675 1500
2:1
5175
Dataset3 4500 1500
3:1
6000
Feature Representation:
A feature is a word that present in document. Any
word in document is called feature if it is satisfies
some predefine constraint (feature selection method),
Term actually a word refers by T; V is a feature vector
that is composed of the various term formed by
analyzing the documents. Every webpage represent by
vector. There is various ways to represent vector
weight (value of each feature in a vector), vector
weight refer by W
Some of them given below:
Term Frequency (TF): Term frequency tfi j is the
number of occurrences of term tj in document Di
Note: Different author and research paper used
different definition of TF some of given below
f (tfi j) = tfi j
f (tfi j) = tfi j / l(Di)
Where l(Di) is the length of document Di ,means total
number of term occurrences in document Di
f (tfi j) =√ tfi j
f (tfi j) =1+log (tfi j)
We can say that tern frequency refers as a local and I
am using TF using
f (tfi j) = tfi j
Binary: Binary representation which indicates
whether a particular term tj occurs in a particular
document or not. In this representation weight of term
tj define as
Wij=1 if
Otherwise Wij=0
Document Frequency (DF): Document Frequency dfj
is the number of documents in the collection (Di
where 1≤i≤n) that term Tj occurs in. Document
Frequency refers as global. In DF we consider only
term occurs or not ignore whatever value of Wij hold.
Inverse Document Frequency (IDF):
Inverse Document Frequency idfj calculate as
follow
idfj=log
(N/dfj)
N: Total number of document
Term frequency–Inverse document
frequency (TF-IDF): Term frequency
multiply by inverse document frequency is
called TF-IDF.
(tf-idf)i j=
tfi j* idfj
III. Performance Measure
Confusion Matrix for Spam and Ham class
 True positive (TP): Correct classifications, spam
documents (positive class) classified as spam
(positive class)
 True negative (TN): Correct classifications, ham
documents (negative class) classified as ham
(negative class)
 False positive (FP): Incorrect classification, FP
occurs when the outcome is incorrectly predicted
as spam (or positive) when it is actually ham
(negative).
 False negative (FN): Incorrect classification, FN
occurs when the outcome is incorrectly predicted
as ham (or negative) when it is actually spam
(positive).
 Accuracy (AC): accuracy is ratio of correct
classification and total number of predictions
Precision:
Precision for a class is the ratio of true class (same
class in actual belong to same class in prediction) and
total number of item belong for that class in
prediction. In other word we can say precision is
accuracy of our classification for this class.
Recall:
Recall for a class is the ratio of true class (same class
in actual belong to same class in prediction) and total
number of item belong for this class in actual. In other
word recall is completeness our classification for this
class.
predicted class
ham (-1) spam (+1)
Actual
Class
ham (-1) TN FP
Spam (+1) FN TP
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 201 | P a g e
False alarm rate:
False alarm rate is define as
Or
FAR=1- Recall for
ham documents
Ex:
TN:-150, FP:-34, FN:-45, TP:-120
Total ham documents =150+34=184
Total spam documents =45+120=165
Ham documents predicted=150+45=195
Spam documents predicted=120+34=154
IV. Experimental Results
To determine our filter’s performance when it
is trained with the various training sets, we evaluate
the filter’s false positive and false negative rates.
4.1
0
20
40
60
80
100
120
Binary
representation
Term Frequency Inverse Document
Frequency
TF-IDF
Test 1-1
Test 2-1
Test 3-1
predicted class
ham (-1) spam (+1)
ActualClass
ham (-1) 150 34
Spam (+1) 45 120
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 202 | P a g e
Spam-precision
4.2
Spam-Recall
4.3
0
20
40
60
80
100
120
Binary
representation
Term Frequency Inverse Document
Frequency
TF-IDF
Test 1-1
Test 2-1
Test 3-1
0
0.2
0.4
0.6
0.8
1
1.2
Binary
representation
Term Frequency Inverse Document
Frequency
TF-IDF
Test 1-1
Test 2-1
Test 3-1
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 203 | P a g e
FAR(false alarm rate)
4.4
Accuracy
V. Results and Discussion
Result with Binary representation
Train Factor
Test 1-1 Test 2-1 Test 3-1
Spam Ham Spam Ham Spam Ham
Train1-1
Pre/rec FAR ACC Pre/rec Pre/rec FAR ACC Pre/rec Pre/rec FAR ACC Pre/rec
2 62.35/93.04 0.562 68.43 86.3/43.82 86.05/90.15 0.358 82.63 72.68/64.2 78.48/93.04 0.765 75.65 52.93/23.47
Full
54.32/90.04 0.757 57.16 70.91/24.27 74.29/79.1 0.671 65.72 39.14/32.93 75.02/90.04 0.899 70.05 25.21/10.07
Result with Inverse Document Frequency
Trai
n
Facto
r
Test 1-1 Test 2-1 Test 3-1
Spam Ham Spam Ham Spam Ham
Train1-1
Pre/rec FAR AC
C
Pre/rec Pre/rec FA
R
AC
C
Pre/rec Pre/rec FA
R
AC
C
Pre/rec
2
87.7/31.0
7
0.043
6
63.3
6
58.12/95.
64
94.83/26.
97
0.03
6
47.0
9
35.01/96
.4
98.73/31.
07
0.01
2 48
32.33/98.
8
Full
55.09/87.
13 0.71
58.0
4
69.23/28.
96
75.72/77.
63 0.61
66.4
3 41.58/39
74.53/87.
13
0.89
3
68.0
2
21.65/10.
67
Result with Term Frequency
Trai
n
Facto
r
Test 1-1 Test 2-1 Test 3-1
Spam Ham Spam Ham Spam Ham
Tr
ai
n
1-
1
Pre/rec FA
R
AC
C
Pre/rec Pre/rec FA
R
AC
C
Pre/rec Pre/rec FA
R
AC
C
Pre/rec
0
10
20
30
40
50
60
70
80
90
Binary
representation
Term Frequency Inverse Document
Frequency
TF-IDF
Test 1-1
Test 2-1
Test 3-1
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 204 | P a g e
2
55.92/96.
91
0.76
4
60.2
6 88.43/23.
6
79.81/96.
35
0.59
7
80.1 81.84/40.
27
75.55/96.
91
0.94
1
74.1
7
39.04/5.9
33
Full
56.57/88.
49
0.67
9
60.2
8
73.58/32.
07
76.41/80.
27
0.60
7
68.3
9
44.82/39.
27
76.98/88.
49
0.79
4
71.5
2 37.36/20.
6
Result with TF-IDF
Trai
n
Facto
r
Test 1-1 Test 2-1 Test 3-1
Spam Ham Spam Ham Spam Ham
Train1-1
Pre/rec FA
R
AC
C
Pre/rec Pre/rec FA
R
AC
C
Pre/rec Pre/rec FA
R
AC
C
Pre/rec
2
52.14/96.
44
0.88
5
53.9
7
76.37/11.
49
76/96.49 0.74
7
75.8
6
74.66/25.
33
74.62/96.
44
0.98
4
72.7
3
13.04/1.6
Full
56.46/88.
91
0.68
6
60.1
7
73.92/31.
42
76.09/80
.6
0.62
1
68.2
3
44.38/37.
93
77.31/88.
91
0.78
3
72.1
2 39.52/21.
73
VI. Conclusion
 In Binary representation test data set test 2:1
perform well in terms of recall precision and
false alarm rate
 IDF representation gives highest false alarm rate
and precision in all testing datasets.
 Data set test 1:1 give less precision in compare to
test 2:1 and test 3:1 data set.
 Dimension reduction of training and test data set
in to 2D and full 2D perform well as compare to
full Dimension.
SUMMARY
The creation of the Internet has
fundamentally changed the way we communicate,
conduct business, and interact with the world around
us. The World Wide Web, and social networking
communities, which provide information consumers
with an unprecedented amount of freely available
information. However, the openness of these
environments has also made them vulnerable to a new
class of attacks called Spam attacks. Attackers launch
these attacks by deliberately inserting low quality
information into information-rich environments to
promote that information or to deny access to high
quality information. These attacks directly threaten the
usefulness and dependability of online information-
rich environments, and as a result, an important
research question is how to automatically identify and
remove this low quality information from these
environments. In this research paper, we focus on
answering this important question by countering Spam
attacks in three of the most important information-rich
environments: email systems, the World Wide Web,
and social networking communities. For each
environment, we perform large-scale data collection
and analysis operations to create massive corpora of
low and high quality information. Then, we use our
collections to identify characteristics that uniquely
distinguish examples of low and high quality
information. Finally, we use our characterizations to
create techniques that automatically detect and remove
low quality information from online information-rich
environments.
References
[1] Alexa, “Alexa top 500 sites.”
http://www.alexa.com/site/ds/top_sites?ts_mo
de=global, 2008.
[2] Anderson, D. S. and others, “Spamscatter:
Characterizing Internet Scam Hosting
Infrastructure,” in Proceedings of 16th
Usenix Security Symposium (Security ’07),
2007
[3] Acquisti, A. and Gross, R., “Imagined
communities: Awareness, information
sharing, and privacy on the facebook,” in
Proceedings of the 6th Workshop on Privacy
Enhancing Technologies (PET ’06), pp. 36 –
58, 2006.
[4] Associated Press, “Official sues students over
myspace page.” http: //www.sfgate.com/cgi-
bin/article.cgi?file=/news/archive/2006/09/22
% /national/a092749D95.DTL, 2006
[5] Androutsopoulos, I., Paliouras, G., and
Michelakis, E., “Learning to Filter
Unsolicited Commercial E-mail,” Tech. Rep.
2004/2, National Center for Scientific
Research “Demokritos”, 2004.
[6] Amitay, E. and others, “The Connectivity
Sonar: Detecting Site Functionality by
Structural Patterns,” in Proceedings of the
Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com
Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205
www.ijera.com 205 | P a g e
14th ACM Conference on Hypertext and
Hypermedia (HYPERTEXT ’03), pp. 38–47,
2003.
[7] Ahamad, M. and others, “Guarding the Next
Internet Frontier: Countering Denial of
Information Attacks,” in Proceedings of the
New Security Paradigms Workshop (NSPW
’02), 2002..
[8] Androutsopoulos, I. and others, “An
Evaluation of Naive Bayesian Anti-Spam
Filtering,” in Proceedings of the Workshop
on Machine Learning in the New Information
Age, 11th European Conference on Machine
Learning, pp. 9–17, 2000.
[9] Androutsopoulos, I. and others, “An
Experimental Comparison of Naïve Bayesian
and Keyword-based Anti-spam Filtering with
Encrypted Personal E-mail Messages,” in
Proceedings of the 23rd Annual International
ACM SIGIR Conference on Research and
Development in Information Retrieval, pp.
160–167, 2000.
[10] Androutsopoulos, I. and others, “Learning to
Filter Spam E-mail: A Comparison of a
Naive Bayesian and a Memory-based
Approach,” in Proceedings of the Workshop
on Machine Learning and Textual
Information Access, 4th European
Conference on Principles and Practice of
Knowledge Discovery in Databases, pp. 1–
13, 2000.
[11] Apte, C., Damerau, F., and Weiss, S. M.,
“Automated Learning of Decision Rules for
Text Categorization,” Information Systems,
vol. 12, no. 3, pp. 233–251, 1994.

More Related Content

What's hot

An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...ijsrd.com
 
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...IJCSEA Journal
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectioneSAT Publishing House
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectioneSAT Journals
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET Journal
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)es712
 
Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learningijtsrd
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningEditor IJCATR
 
News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisTELKOMNIKA JOURNAL
 
Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257Editor IJARCET
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsIJCSIS Research Publications
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCSCJournals
 

What's hot (16)

An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
 
Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...Study, analysis and formulation of a new method for integrity protection of d...
Study, analysis and formulation of a new method for integrity protection of d...
 
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detection
 
A multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detectionA multi classifier prediction model for phishing detection
A multi classifier prediction model for phishing detection
 
IRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection SystemIRJET- Suspicious Email Detection System
IRJET- Suspicious Email Detection System
 
Topic modeling
Topic modelingTopic modeling
Topic modeling
 
Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)Social media recommendation based on people and tags (final)
Social media recommendation based on people and tags (final)
 
Fake News Detection using Machine Learning
Fake News Detection using Machine LearningFake News Detection using Machine Learning
Fake News Detection using Machine Learning
 
Final ppt
Final pptFinal ppt
Final ppt
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Resume parser
Resume parserResume parser
Resume parser
 
News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic Analysis
 
Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257Ijarcet vol-2-issue-7-2252-2257
Ijarcet vol-2-issue-7-2252-2257
 
Context Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word PairsContext Sensitive Relatedness Measure of Word Pairs
Context Sensitive Relatedness Measure of Word Pairs
 
Cluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector MachineCluster Based Web Search Using Support Vector Machine
Cluster Based Web Search Using Support Vector Machine
 

Viewers also liked (20)

Ae35171177
Ae35171177Ae35171177
Ae35171177
 
Bt35408413
Bt35408413Bt35408413
Bt35408413
 
Bj35343348
Bj35343348Bj35343348
Bj35343348
 
Bf35324328
Bf35324328Bf35324328
Bf35324328
 
Bv35420424
Bv35420424Bv35420424
Bv35420424
 
Be35317323
Be35317323Be35317323
Be35317323
 
Bi35339342
Bi35339342Bi35339342
Bi35339342
 
Ai35194197
Ai35194197Ai35194197
Ai35194197
 
Cv35547551
Cv35547551Cv35547551
Cv35547551
 
Bs35401407
Bs35401407Bs35401407
Bs35401407
 
Bq35386393
Bq35386393Bq35386393
Bq35386393
 
Bw35425429
Bw35425429Bw35425429
Bw35425429
 
Bm35359363
Bm35359363Bm35359363
Bm35359363
 
Bc35306309
Bc35306309Bc35306309
Bc35306309
 
Bd35310316
Bd35310316Bd35310316
Bd35310316
 
Bn35364376
Bn35364376Bn35364376
Bn35364376
 
Bk35349352
Bk35349352Bk35349352
Bk35349352
 
Af35178182
Af35178182Af35178182
Af35178182
 
Bu35414419
Bu35414419Bu35414419
Bu35414419
 
Bg35329332
Bg35329332Bg35329332
Bg35329332
 

Similar to Aj35198205

Survey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamSurvey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamIRJET Journal
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningIRJET Journal
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET Journal
 
NEr using N-Gram techniqueppt
NEr using N-Gram techniquepptNEr using N-Gram techniqueppt
NEr using N-Gram techniquepptGyandeep Kansal
 
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...ijtsrd
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET Journal
 
A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMIRJET Journal
 
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmIndonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmjournalBEEI
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine LearningIRJET Journal
 
IRJET- Detection of Ranking Fraud in Mobile Applications
IRJET-  	  Detection of Ranking Fraud in Mobile ApplicationsIRJET-  	  Detection of Ranking Fraud in Mobile Applications
IRJET- Detection of Ranking Fraud in Mobile ApplicationsIRJET Journal
 
A Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data ClusteringA Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data Clusteringidescitation
 
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...IJCSEA Journal
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...IRJET Journal
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classificationAlexander Decker
 
Hybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classificationHybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classificationAlexander Decker
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM cscpconf
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM csandit
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated ProcessIRJET Journal
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesiaemedu
 

Similar to Aj35198205 (20)

Survey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based SpamSurvey in Online Social Media Skelton by Network based Spam
Survey in Online Social Media Skelton by Network based Spam
 
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine LearningSentiment Analysis: A comparative study of Deep Learning and Machine Learning
Sentiment Analysis: A comparative study of Deep Learning and Machine Learning
 
Ijetcas14 446
Ijetcas14 446Ijetcas14 446
Ijetcas14 446
 
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning AlgorithmsIRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
IRJET- Sentimental Analysis for Online Reviews using Machine Learning Algorithms
 
NEr using N-Gram techniqueppt
NEr using N-Gram techniquepptNEr using N-Gram techniqueppt
NEr using N-Gram techniqueppt
 
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
Netspam: An Efficient Approach to Prevent Spam Messages using Support Vector ...
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
A Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVMA Survey on Spam Filtering Methods and Mapreduce with SVM
A Survey on Spam Filtering Methods and Mapreduce with SVM
 
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithmIndonesian language email spam detection using N-gram and Naïve Bayes algorithm
Indonesian language email spam detection using N-gram and Naïve Bayes algorithm
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine Learning
 
IRJET- Detection of Ranking Fraud in Mobile Applications
IRJET-  	  Detection of Ranking Fraud in Mobile ApplicationsIRJET-  	  Detection of Ranking Fraud in Mobile Applications
IRJET- Detection of Ranking Fraud in Mobile Applications
 
A Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data ClusteringA Heuristic Approach for Network Data Clustering
A Heuristic Approach for Network Data Clustering
 
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
COMPARISON OF COLLABORATIVE FILTERING ALGORITHMS WITH VARIOUS SIMILARITY MEAS...
 
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...An Approach for Malicious Spam Detection in Email with Comparison of Differen...
An Approach for Malicious Spam Detection in Email with Comparison of Differen...
 
11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification11.hybrid ga svm for efficient feature selection in e-mail classification
11.hybrid ga svm for efficient feature selection in e-mail classification
 
Hybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classificationHybrid ga svm for efficient feature selection in e-mail classification
Hybrid ga svm for efficient feature selection in e-mail classification
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
 
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM
 
Bug Triage: An Automated Process
Bug Triage: An Automated ProcessBug Triage: An Automated Process
Bug Triage: An Automated Process
 
Integration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniquesIntegration of feature sets with machine learning techniques
Integration of feature sets with machine learning techniques
 

Recently uploaded

Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 

Recently uploaded (20)

Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 

Aj35198205

  • 1. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 198 | P a g e Content based web spam detection using naive bayes with different feature representation technique Amit Anand Soni1 , Abhishek Mathur2 Research Scholar SATI Engg. Vidisha (M.P.) Asst. Professor SATI Engg Vidisha (M.P.) Abstract Web Spam Detection is the processing to organize the search result according to specified criteria. Most often this refers to the automatic processing of search result, but the term also applies to the automatic classification of search results into ham and spam. Our work also evaluates change in performance by using different representation for the document vector like term frequency (TF), Binary, inverse document frequency (IDF) and TF-IDF. There are various Benchmark Datasets available for researchers related to web spam filtering. There has been significant effort to generate public benchmark datasets for anti- web spam filtering. One of the main concerns is how to protect the privacy of the users whose ham links are included in the datasets. We perform a statistical analysis of a large collection of WebPages, focusing on spam detection. Dimension reduction is important part of classification because it provides ease to visualize high dimensional data. This work reduce dimension of training data in 2D and full and mapped training and test data in to vector space. There are several classification here we use Naive Bayes classification and train data set with varying different representation and testing perform with different spam ham ratio Key-Words: - Content spam, keyword count, variety, density and Hidden or invisible text I. INTRODUCTION Search engines are widely used tools for effectively exploring information on the Web. One of the core components of a search engine is its ranking function: when a search engine receives a user query, this function determines the order of presentation of retrieved results (documents or web URLs). The main goal of the ranking process is to promote high-quality and relevant content to the top of the result list, which is an important and challenging problem by itself. In this work we propose a method for improving the quality of ranking of search results that addresses the two important aspects mentioned above through the temporal analysis of search logs. First, we identify an interesting link between email spam and Web spam, and we use this link to propose a novel technique for extracting large Web spam samples from the Web. Then, we present the Webb Spam Corpus – a first-of-its-kind, large-scale, and publicly available Web spam data set that was created using our automated Web spam collection method. While performing our classifier evaluations, we identified a clear tension between spam producers and information consumers. Spam producers are constantly evolving their technique to ensure their spam messages are delivered, and information consumers are constantly evolving their countermeasures to ensure they don’t receive spam messages. Based on the results of our evolutionary study, we began to question the validity of retraining as a solution for camouflaged messages. Since spammers continually evolve their techniques, we believed they would also evolve their camouflaged messages, making them more sophisticated over time. This process continues until both parties are firmly entrenched in a spam arms race. Fortunately, in this thesis, we propose two solutions that allow information consumers to break free of this arms race. The second contribution of this thesis is a framework for collecting, analyzing, and classifying examples of Spam attacks in the World Wide Web. Just as email spam has negatively impacted the user messaging experience, the rise of Web spam is threatening to severely degrade the quality of information on the World Wide Web. Fundamentally, Web spam is designed to pollute search engines and corrupt the user experience by driving traffic to particular spammed Web pages, regardless of the merits of those pages. Hence, we present various techniques for automatically identifying and removing these pages from the Web. II. RELATEDWORK In this section, we provide an overview of previous efforts to improve the ranking of search results by introducing a better ranking function or a method to detect and eliminate adversarial content, the two major research directions, highly relevant to the present work. The learning-to-rank approaches are capable of combining different kinds of features to train the ranking function. A number of previous RESEARCH ARTICLE OPEN ACCESS
  • 2. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 199 | P a g e works have also focused on exploring the methods to obtain useful information from click-through data, which could benefit search relevance 2.1 Statistical Classification of Email Spam Email classification can be characterized as the problem of assigning a boolean value (“spam” or “legitimate”) to each email message M in a collection of email messages M. More formally, the task of spam classification is to approximate the unknown target function Φ: M! {Spam, legitimate}, which describes how messages are to be classified, by means of a function Ǿ: M! {Spam, legitimate} called the classifier (or model), such that Φ and Ǿ coincide as much as possible. Different learning methods have been explored by the research community for building spam classifiers (also called spam filters). In our email spam experiments, we focus on three learning algorithms: Na¨ıve Bayes, Support Vector Machines (SVM), and LogitBoost. In the following sections, we will briefly summarize the important details of each of these algorithms. 2.1.1 Naive Bayes Naive Bayes is one of the simplest classification methods in machine learning. This work use NB because of it takes less training time and Very easy to deal with missing attributes. In the experiments each message is represented as a vector Vi= {T1. . .Tm}( Vi is a feature vector of document i) where T1. . .Tm are the feature and Wi1, Wi2.....Wim are the weight of term T1. . .Tm. We are doing spam filtering in which we have only two classes. Given a classification task of 2 classes C1, C2 and an unknown pattern, which is represented by a feature vector V, form the two conditional probabilities for i=1, 2 Sometimes, these are also referred to as a posteriori probabilities. In words, each of them represents the probability that the unknown pattern belongs to the respective class Ci. Let C1 (spam), C2 (ham) be the two classes in which message belong. Assume that the a priori probabilities P (C1), P (C2) are known. If P (C1), P (C2) are unknown than easily calculated from training dataset. If N total number of mails (spam ham) in training dataset in which N1 belongs to C1 (spam) class and N2 belongs to C2 (ham) class then Now compute conditional probability. Where p (V) is the pdf of V The Bayes classification rule can now be stated as If > , V is classified to If < , V is classified to In case of both are equal then we assign vector X in either class. Here we don’t consider p (V), because it is same for all classes. If the a priori probabilities are equal Than 2.1.2 Dimension reduction: DR is important part of classification because it provides ease to visualize high dimensional data. Singular Value Decomposition (SVD): Data set representation in the form of term document matrix that represents n number of document and m number of term that describe every document. Suppose A is a document term matrix of nxm matrix of data set A, Aij shows the feature j for documents i. Every row of A represented by document (vector of term with m dimension) and number of column called dimension of vector. Mathematical decomposition of matrix: Mathematically matrix A of nxm is decomposing into three parts. Decomposition of matrix is given below. Here, d: Represent number of document. t: Represent number of term in document vector. A [d x t] = U [d x t] *S [t x t] *(V [t x t]) T Decomposition of matrix using SVD Preprocessing of Dataset: The data set is subjected to the pre- processing. The dataset contains two labeled files which show that the link is spam or normal. From these files constructed our data. Link belongs to which category known to us so it can be easily separable. Wrote a program to extract the content of the pages and save the result into a corresponding text files. Generate a sparse matrix which contains the observation and features. Observations are rows and features are columns.
  • 3. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 200 | P a g e Table Train Dataset Datase ts Training Spa m: Ham ratio Tot al Spa m Ha m Datase t1 449 6 436 1 1:1 885 7 Table Test Dataset Datasets Testing Spam: Ham ratio Total Spam Ham Dataset1 4500 4500 1:1 9000 Dataset2 3675 1500 2:1 5175 Dataset3 4500 1500 3:1 6000 Feature Representation: A feature is a word that present in document. Any word in document is called feature if it is satisfies some predefine constraint (feature selection method), Term actually a word refers by T; V is a feature vector that is composed of the various term formed by analyzing the documents. Every webpage represent by vector. There is various ways to represent vector weight (value of each feature in a vector), vector weight refer by W Some of them given below: Term Frequency (TF): Term frequency tfi j is the number of occurrences of term tj in document Di Note: Different author and research paper used different definition of TF some of given below f (tfi j) = tfi j f (tfi j) = tfi j / l(Di) Where l(Di) is the length of document Di ,means total number of term occurrences in document Di f (tfi j) =√ tfi j f (tfi j) =1+log (tfi j) We can say that tern frequency refers as a local and I am using TF using f (tfi j) = tfi j Binary: Binary representation which indicates whether a particular term tj occurs in a particular document or not. In this representation weight of term tj define as Wij=1 if Otherwise Wij=0 Document Frequency (DF): Document Frequency dfj is the number of documents in the collection (Di where 1≤i≤n) that term Tj occurs in. Document Frequency refers as global. In DF we consider only term occurs or not ignore whatever value of Wij hold. Inverse Document Frequency (IDF): Inverse Document Frequency idfj calculate as follow idfj=log (N/dfj) N: Total number of document Term frequency–Inverse document frequency (TF-IDF): Term frequency multiply by inverse document frequency is called TF-IDF. (tf-idf)i j= tfi j* idfj III. Performance Measure Confusion Matrix for Spam and Ham class  True positive (TP): Correct classifications, spam documents (positive class) classified as spam (positive class)  True negative (TN): Correct classifications, ham documents (negative class) classified as ham (negative class)  False positive (FP): Incorrect classification, FP occurs when the outcome is incorrectly predicted as spam (or positive) when it is actually ham (negative).  False negative (FN): Incorrect classification, FN occurs when the outcome is incorrectly predicted as ham (or negative) when it is actually spam (positive).  Accuracy (AC): accuracy is ratio of correct classification and total number of predictions Precision: Precision for a class is the ratio of true class (same class in actual belong to same class in prediction) and total number of item belong for that class in prediction. In other word we can say precision is accuracy of our classification for this class. Recall: Recall for a class is the ratio of true class (same class in actual belong to same class in prediction) and total number of item belong for this class in actual. In other word recall is completeness our classification for this class. predicted class ham (-1) spam (+1) Actual Class ham (-1) TN FP Spam (+1) FN TP
  • 4. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 201 | P a g e False alarm rate: False alarm rate is define as Or FAR=1- Recall for ham documents Ex: TN:-150, FP:-34, FN:-45, TP:-120 Total ham documents =150+34=184 Total spam documents =45+120=165 Ham documents predicted=150+45=195 Spam documents predicted=120+34=154 IV. Experimental Results To determine our filter’s performance when it is trained with the various training sets, we evaluate the filter’s false positive and false negative rates. 4.1 0 20 40 60 80 100 120 Binary representation Term Frequency Inverse Document Frequency TF-IDF Test 1-1 Test 2-1 Test 3-1 predicted class ham (-1) spam (+1) ActualClass ham (-1) 150 34 Spam (+1) 45 120
  • 5. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 202 | P a g e Spam-precision 4.2 Spam-Recall 4.3 0 20 40 60 80 100 120 Binary representation Term Frequency Inverse Document Frequency TF-IDF Test 1-1 Test 2-1 Test 3-1 0 0.2 0.4 0.6 0.8 1 1.2 Binary representation Term Frequency Inverse Document Frequency TF-IDF Test 1-1 Test 2-1 Test 3-1
  • 6. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 203 | P a g e FAR(false alarm rate) 4.4 Accuracy V. Results and Discussion Result with Binary representation Train Factor Test 1-1 Test 2-1 Test 3-1 Spam Ham Spam Ham Spam Ham Train1-1 Pre/rec FAR ACC Pre/rec Pre/rec FAR ACC Pre/rec Pre/rec FAR ACC Pre/rec 2 62.35/93.04 0.562 68.43 86.3/43.82 86.05/90.15 0.358 82.63 72.68/64.2 78.48/93.04 0.765 75.65 52.93/23.47 Full 54.32/90.04 0.757 57.16 70.91/24.27 74.29/79.1 0.671 65.72 39.14/32.93 75.02/90.04 0.899 70.05 25.21/10.07 Result with Inverse Document Frequency Trai n Facto r Test 1-1 Test 2-1 Test 3-1 Spam Ham Spam Ham Spam Ham Train1-1 Pre/rec FAR AC C Pre/rec Pre/rec FA R AC C Pre/rec Pre/rec FA R AC C Pre/rec 2 87.7/31.0 7 0.043 6 63.3 6 58.12/95. 64 94.83/26. 97 0.03 6 47.0 9 35.01/96 .4 98.73/31. 07 0.01 2 48 32.33/98. 8 Full 55.09/87. 13 0.71 58.0 4 69.23/28. 96 75.72/77. 63 0.61 66.4 3 41.58/39 74.53/87. 13 0.89 3 68.0 2 21.65/10. 67 Result with Term Frequency Trai n Facto r Test 1-1 Test 2-1 Test 3-1 Spam Ham Spam Ham Spam Ham Tr ai n 1- 1 Pre/rec FA R AC C Pre/rec Pre/rec FA R AC C Pre/rec Pre/rec FA R AC C Pre/rec 0 10 20 30 40 50 60 70 80 90 Binary representation Term Frequency Inverse Document Frequency TF-IDF Test 1-1 Test 2-1 Test 3-1
  • 7. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 204 | P a g e 2 55.92/96. 91 0.76 4 60.2 6 88.43/23. 6 79.81/96. 35 0.59 7 80.1 81.84/40. 27 75.55/96. 91 0.94 1 74.1 7 39.04/5.9 33 Full 56.57/88. 49 0.67 9 60.2 8 73.58/32. 07 76.41/80. 27 0.60 7 68.3 9 44.82/39. 27 76.98/88. 49 0.79 4 71.5 2 37.36/20. 6 Result with TF-IDF Trai n Facto r Test 1-1 Test 2-1 Test 3-1 Spam Ham Spam Ham Spam Ham Train1-1 Pre/rec FA R AC C Pre/rec Pre/rec FA R AC C Pre/rec Pre/rec FA R AC C Pre/rec 2 52.14/96. 44 0.88 5 53.9 7 76.37/11. 49 76/96.49 0.74 7 75.8 6 74.66/25. 33 74.62/96. 44 0.98 4 72.7 3 13.04/1.6 Full 56.46/88. 91 0.68 6 60.1 7 73.92/31. 42 76.09/80 .6 0.62 1 68.2 3 44.38/37. 93 77.31/88. 91 0.78 3 72.1 2 39.52/21. 73 VI. Conclusion  In Binary representation test data set test 2:1 perform well in terms of recall precision and false alarm rate  IDF representation gives highest false alarm rate and precision in all testing datasets.  Data set test 1:1 give less precision in compare to test 2:1 and test 3:1 data set.  Dimension reduction of training and test data set in to 2D and full 2D perform well as compare to full Dimension. SUMMARY The creation of the Internet has fundamentally changed the way we communicate, conduct business, and interact with the world around us. The World Wide Web, and social networking communities, which provide information consumers with an unprecedented amount of freely available information. However, the openness of these environments has also made them vulnerable to a new class of attacks called Spam attacks. Attackers launch these attacks by deliberately inserting low quality information into information-rich environments to promote that information or to deny access to high quality information. These attacks directly threaten the usefulness and dependability of online information- rich environments, and as a result, an important research question is how to automatically identify and remove this low quality information from these environments. In this research paper, we focus on answering this important question by countering Spam attacks in three of the most important information-rich environments: email systems, the World Wide Web, and social networking communities. For each environment, we perform large-scale data collection and analysis operations to create massive corpora of low and high quality information. Then, we use our collections to identify characteristics that uniquely distinguish examples of low and high quality information. Finally, we use our characterizations to create techniques that automatically detect and remove low quality information from online information-rich environments. References [1] Alexa, “Alexa top 500 sites.” http://www.alexa.com/site/ds/top_sites?ts_mo de=global, 2008. [2] Anderson, D. S. and others, “Spamscatter: Characterizing Internet Scam Hosting Infrastructure,” in Proceedings of 16th Usenix Security Symposium (Security ’07), 2007 [3] Acquisti, A. and Gross, R., “Imagined communities: Awareness, information sharing, and privacy on the facebook,” in Proceedings of the 6th Workshop on Privacy Enhancing Technologies (PET ’06), pp. 36 – 58, 2006. [4] Associated Press, “Official sues students over myspace page.” http: //www.sfgate.com/cgi- bin/article.cgi?file=/news/archive/2006/09/22 % /national/a092749D95.DTL, 2006 [5] Androutsopoulos, I., Paliouras, G., and Michelakis, E., “Learning to Filter Unsolicited Commercial E-mail,” Tech. Rep. 2004/2, National Center for Scientific Research “Demokritos”, 2004. [6] Amitay, E. and others, “The Connectivity Sonar: Detecting Site Functionality by Structural Patterns,” in Proceedings of the
  • 8. Amit Anand Soni et al. Int. Journal of Engineering Research and Application www.ijera.com Vol. 3, Issue 5, Sep-Oct 2013, pp.198-205 www.ijera.com 205 | P a g e 14th ACM Conference on Hypertext and Hypermedia (HYPERTEXT ’03), pp. 38–47, 2003. [7] Ahamad, M. and others, “Guarding the Next Internet Frontier: Countering Denial of Information Attacks,” in Proceedings of the New Security Paradigms Workshop (NSPW ’02), 2002.. [8] Androutsopoulos, I. and others, “An Evaluation of Naive Bayesian Anti-Spam Filtering,” in Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, pp. 9–17, 2000. [9] Androutsopoulos, I. and others, “An Experimental Comparison of Naïve Bayesian and Keyword-based Anti-spam Filtering with Encrypted Personal E-mail Messages,” in Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167, 2000. [10] Androutsopoulos, I. and others, “Learning to Filter Spam E-mail: A Comparison of a Naive Bayesian and a Memory-based Approach,” in Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pp. 1– 13, 2000. [11] Apte, C., Damerau, F., and Weiss, S. M., “Automated Learning of Decision Rules for Text Categorization,” Information Systems, vol. 12, no. 3, pp. 233–251, 1994.