SlideShare uma empresa Scribd logo
1 de 5
Baixar para ler offline
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

RESEARCH ARTICLE

www.ijera.com

OPEN ACCESS

Data Mining Tool using Clustering Technique on Exploration
Engine Dataset
Mahesh Kumar KondReddy, Sujeeth .T
Dept. of Computer Science &Engineering, Sri Venkateswara University,Tirupati, Andhra Pradesh, India

ABSTRACT
It is a major issue to retrieve good websites from the larger collections of websites. As the number of available
Web pages grows, it is become more difficult for users finding documents relevant to their interests. Clustering
is the classification of a data set into subsets (clusters), so that the data in each subset share some common trait often proximity according to some defined distance measure. By clustering we improve the quality of websites
by grouping similar websites in groups. This paper addresses the applications of data mining tool Weka by
applying k means clustering to find clusters from huge data sets and find the attributes that govern optimization
of search engines. Unlabeled document collections are becoming increasingly common and mining such
databases becomes a major challenge.
Keywords—Websites; Data mining; Weka; k-means, Dataset.

I.

INTRODUCTION

The Web has experienced continuous
growth since its creation. As of March 2002, the
largest search engine contained approximately 968
million indexed pages in its database. Finding the
right information from such a large collection is
extremely difficult [3].Information extraction plays
a vital role in today's life. How efficiently and
effectively the relevant documents are extracted
from World Wide Web is a challenging issue. As
today's search engine does just string matching,
documents retrieved may not be so relevant
according to user's query. By clustering the
websites, the websites having values of attributes
are in particular range are grouped together [6]. Data
is collected from various websites source code like
their title length, number of keywords in title, url
length, number of backlinks etc and based on this
we derive the conclusion. A popular technique for
clustering is based on K-means such that the data is
partitioned into K clusters. In this method, the
groups are identified by a set of points that are
called the cluster centers.

II.

DOCUMENT CLUSTERING

Document clustering analysis plays an
important role
in document mining research. A widely adopted
definition of optimal clustering is a partitioning that
minimizes distances within a cluster and maximizes
distances between clusters.
In this approach the clusters and, to a limited degree,
relationships between clusters are derived
automatically from the documents to be clustered,
and the documents are subsequently assigned to
those clusters [1].Users are known to have
www.ijera.com

difficulties in dealing with information retrieval
search outputs especially if the outputs are above a
certain size. Clustering can enable them to find the
relevant documents more easily and also help them
to form an understanding of the different facets of
the query that have been provided for their
inspection. This project aimed to investigate the
websites that are in top 5 in one cluster and other
sites in second cluster. Clustering based on k-means
is closely related to a number of other clustering and
location problems. These include the Euclidean kmedians , in which the objective is to minimize the
sum of distances to the nearest center, and the
geometric k-center problem in which the objective is
to minimize the maximum distance from every point
to its closest center.
A.

K means Algorithm
K-Means clustering is a very popular
algorithm to find the clustering in dataset by
iterative computations. It has the advantages of
simple implementing and finding at least local
optimal clustering. K-Means algorithm is employed
to find the clustering in dataset. The algorithm
[2],[9] is composed of the following steps:


Initialize k cluster centers to be seed points.
(These centers can be randomly produced or
use other ways to generate).



For each sample, find the nearest cluster
center, put the sample in this cluster and
recompute centers of the altered cluster
(Repeat n times).



Exam all samples again and put each one in
2032 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036
the cluster identified with the nearest center
(don’t recompute any cluster centers). If
members of each cluster haven’t been
changed, stop. If changed, go to step 2.

III.

THE METHOD FOR FINDING
MIXTURE
STRUCTURE
IN
HETEROGENEOUS
MULTIVARIATE DATA SET

A. Partly And Completely Overlapped Group
Structures In Heteregenous Multivariate Data
Set
Partly and completely overlapped group
structures make it difficult to determine the number
and structure of clusters in multivariate data set. The
cases of partly and completely overlapped group
structures in heterogeneous data are shown in Figure
01. It shows also mixture structure in data.
The first case is partly group overlapping, in this
case group means are different but very close to
each other. Group variances are the same as shown
in Figure 01(a). The second case is completely
group overlapping, in this case group means are the
same. Group variances are different from each other
as shown in Figure 01(b). The horizontal axis
denotes the values of independent variable and the
vertical axis denotes the values of dependent
variable with respect to the values of independent
variable having heterogeneous structure thus having
mixture structure in Figure 01. There is only one
independent variable having two subgroups in
Figure 01.

other.
B. Algorithm For Diagnosing Mixture Structure In
Multivariate Heteregoneous Data Set And
Refining Groups
Model selection methods based on information
criterions is applied for diagnosing the mixture
structure in heterogeneous multivariate data set. The
algorithm given in this section is a new algorithm.
The steps of the algorithm proposed for refining
groups in data using dynamic model based
clustering is given below:
1. Apply model based clustering for multivariate
data set by assuming that multivariate data have
mixture structure thus data comes from a
mixture of multivariate normal densities. This
step shows either multivariate data is
homogeneous thus there is no mixture structure
or multivariate data have a group structure thus
there is a mixture structure.
2. If multivariate data have mixture structure or
multivariate data contains groups then find these
groups in multivariate data. Test each group
found in multivariate data for homogeneity. In
other words check each group found for further
mixture structure or subgroups.

IV.

ORGANIZATION OF DATA

It’s important for search engine to maintain
a high quality websites. This will improve the
optimization. We made a database in which
following attributes we take length of title,
keywords in title, Domain length, and number of
backlinks and Top rank website .
A. Working with Weka on Dataset
Open Weka, and then click on right side
option explorer then Open data file under preprocess
option which is in csv or arff format [4],[5]. As we
choose the explorer option it will appear as given
below, the screen shot in fig 1. Clearly indicate the
open file option. Now we click on view open file
and choose the data set. Weka provides filters to
accomplish all of these preprocessing tasks, they are
not necessary for clustering in Weka .This is
because Weka Simple K Means algorithm
automatically handles a mixture of categorical and
numerical attributes. This algorithm automatically
normalizes numerical attributes when doing distance
computations [7].

A

B
Figure 01. (a) The case of partly overlapping:
group means are different but very close to each
other. Group variances are the same. (b) The case
of completely overlapping: group means are the
same. Group variances are different from each
www.ijera.com

www.ijera.com

This gives all attributes that are present in dataset.
We can select any one which we want to include or
select all.

2033 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

www.ijera.com

Fig 3: Choose parameters

Fig.1: Opening page
After this just click on cluster tab and click on
choose button on left side and select clustering
algorithm which we want to apply, we select simple
k means the screen appears below in fig 2. [8]

Note that, in general, K-means is quite sensitive to
how clusters are initially assigned. Thus, it is often
necessary to try different values and evaluate the
results. [10].
Once the options have been specified, we can run
the clustering algorithm. Here we make sure that in
the "Cluster Mode" panel, the "Use training set"
option is selected, and we click "Start". We can
right click the result set in the "Result list" panel
and view the results of clustering in a separate
window.
The result window shows the centroid of each
cluster as well as statistics on the number and
percentage of instances assigned to different
clusters. Cluster centroids are the mean vectors for
each cluster (so, each dimension value in the
centroid represents the mean value for that
dimension in the cluster). Thus, centroids can be
used to characterize the clusters.

Fig.2: Select algorithm
Next, click on the text box to the right of the
"Choose" button to get the pop-up window shown in
Fig 3, for editing the clustering parameter. In the
pop-up window we enter 2 as the number of clusters
and we leave the value of "seed" as is. The seed
value is used in generating a random number which
is, in turn, used for making the initial assignment of
instances to clusters.

The result shows that in cluster 0 there are 13
websites that have length of title 59 characters long,
keywords in title are 5, url length 22 characters long
and number of backlinks are 6638 and in cluster 1
there are 16 websites that have length of title 36
characters long, keywords in title are 3, url length
29 characters long and number of backlinks are
19163 as shown in fig 4.
Another way of understanding the characteristics of
each cluster is through visualization. We can do this
by right-clicking the result set on the left "Result
list" panel and selecting "Visualize cluster
assignments". This pops up the visualization
window as shown in Fig 5.
In this, we choose the cluster number and any of the
other attributes for each of the three different
dimensions available (x-axis, y-axis, and color).
Different combinations of choices will result in a
visual rendering of different relationships within
each cluster.

www.ijera.com

2034 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

www.ijera.com

Fig. 4: Result of clustering

Fig.5: Visual result

www.ijera.com

2035 | P a g e
M. K. Kond Reddy et al Int. Journal of Engineering Research and Application
ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036

www.ijera.com

In the above example, we have chosen the cluster
number as the x-axis, the instance number (assigned
by Weka) as the y-axis, and the "length of title"
attribute as the color dimension. This will result in a
visualization of the distribution of length of title in
two clusters.
As more and more data is collected from websites
we can get more detail and can find attributes as by
this method we find backlinks > 19000 , length of
title < 40 , keywords in title > 3 and Domain length
< 30 is good for search engine optimization

REFERENCES
[1]

A Document Clustering Algorithm for Web
Search Engine Retrieval System, Hongwei
Yang School of Software Yunnan
University, Kunming 650021, China;

[2]

S. Kantabutra, Efficient Representation of
Cluster Structure in Large Data Sets, Ph.D.
Thesis, Tufts University, Medford MA,
September 2001.

[3]

Wang Jun, OuYang Zheng-Zheng “The
Research of KMeans Clustering Algorithm
Based on Association Rules “

[4]http://maya.cs.depaul.edu/classes/ect584/weka/
k-means.html
[5].http://www.cs.ccsu.edu/~markov/wekatutorial.
pdf.
[6]http://thesai.org/Downloads/Volume3No4/Pape
r_20
Knowledge_Discovery_in_Health_Care_Dat
asets Using_Data_Mining_Tools.pdf
[7]www.gtbit.org/downloads/dwdmsem6/dwdmse
m6lman.pdf
[8]http://www.iasri.res.in/ebook/win_school_aa/n
otes/WEKA.pdf
[9]

R. Kannan, S. Vempala, and Adrian Vetta,
“On ClusteringsGood, Bad, and Spectral”,
Proc. of the 41st Foundations of Computer
Science, Redondo Beach, 2000.5.

[10]http://www.bvicam.ac.in/news/INDIACom%
202010%20Proceedingspapers/Group3/IND
IACom10_388_Paper.pdf.

www.ijera.com

2036 | P a g e

Mais conteúdo relacionado

Mais procurados

COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATAcscpconf
 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex datacsandit
 
A03202001005
A03202001005A03202001005
A03202001005theijes
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXINGIJDKP
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization janani thirupathi
 
B colouring
B colouringB colouring
B colouringxs76250
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for properIJDKP
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentIJDKP
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkagesjournal ijrtem
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...IRJET Journal
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysiss v
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYIJDKP
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...IOSR Journals
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmeSAT Publishing House
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGIJDKP
 
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...Waqas Tariq
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...IOSR Journals
 

Mais procurados (20)

G0354451
G0354451G0354451
G0354451
 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex data
 
A03202001005
A03202001005A03202001005
A03202001005
 
GCUBE INDEXING
GCUBE INDEXINGGCUBE INDEXING
GCUBE INDEXING
 
data generalization and summarization
data generalization and summarization data generalization and summarization
data generalization and summarization
 
B colouring
B colouringB colouring
B colouring
 
Effective data mining for proper
Effective data mining for properEffective data mining for proper
Effective data mining for proper
 
A statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environmentA statistical data fusion technique in virtual data integration environment
A statistical data fusion technique in virtual data integration environment
 
Assessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data LinkagesAssessment of Cluster Tree Analysis based on Data Linkages
Assessment of Cluster Tree Analysis based on Data Linkages
 
Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...Cancer data partitioning with data structure and difficulty independent clust...
Cancer data partitioning with data structure and difficulty independent clust...
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
 
F04463437
F04463437F04463437
F04463437
 
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITYSOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
SOURCE CODE RETRIEVAL USING SEQUENCE BASED SIMILARITY
 
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
K Means Clustering Algorithm for Partitioning Data Sets Evaluated From Horizo...
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MININGPATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
PATTERN GENERATION FOR COMPLEX DATA USING HYBRID MINING
 
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
Improved Slicing Algorithm For Greater Utility In Privacy Preserving Data Pub...
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
Protecting Attribute Disclosure for High Dimensionality and Preserving Publis...
 

Destaque

Marketing Strategy cert
Marketing Strategy certMarketing Strategy cert
Marketing Strategy certTwila Arias
 
April 2012 - Made in Brazil
April 2012 - Made in BrazilApril 2012 - Made in Brazil
April 2012 - Made in BrazilFGV Brazil
 
Colciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power pointColciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power pointjoel150211
 
Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)Gabriela Morales
 
Computer Networking meets Social Psychology
Computer Networking meets Social PsychologyComputer Networking meets Social Psychology
Computer Networking meets Social PsychologyWaldir Moreira
 
The pit and the pendulum (1)
The pit and the pendulum (1)The pit and the pendulum (1)
The pit and the pendulum (1)Amelia Payne
 
“暗算の達人” で数字で話せる人を目指す
“暗算の達人”で数字で話せる人を目指す“暗算の達人”で数字で話せる人を目指す
“暗算の達人” で数字で話せる人を目指すbijikin
 
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)Teacher Sophonnawit
 
Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1guest6bd45fe
 
TEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMESTEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMESmceide
 
Presentación1
Presentación1Presentación1
Presentación1Fredy003
 
Whpmt1 me0059 d2
Whpmt1 me0059 d2Whpmt1 me0059 d2
Whpmt1 me0059 d2Tuan Pd
 
Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]viny_pt
 
A que cheiram as cores
A que cheiram as coresA que cheiram as cores
A que cheiram as coresano1ar
 
#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentas#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentasAlma Karime
 
Cadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercialCadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercialviny_pt
 

Destaque (20)

Тренаж по орфоэпии
Тренаж по орфоэпииТренаж по орфоэпии
Тренаж по орфоэпии
 
Marketing Strategy cert
Marketing Strategy certMarketing Strategy cert
Marketing Strategy cert
 
Tipos de software
Tipos de softwareTipos de software
Tipos de software
 
April 2012 - Made in Brazil
April 2012 - Made in BrazilApril 2012 - Made in Brazil
April 2012 - Made in Brazil
 
Colciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power pointColciencias y los recursos de inversión power point
Colciencias y los recursos de inversión power point
 
Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)Portafolios digitales y gestión del conocimiento (resumen del curso)
Portafolios digitales y gestión del conocimiento (resumen del curso)
 
Computer Networking meets Social Psychology
Computer Networking meets Social PsychologyComputer Networking meets Social Psychology
Computer Networking meets Social Psychology
 
The pit and the pendulum (1)
The pit and the pendulum (1)The pit and the pendulum (1)
The pit and the pendulum (1)
 
“暗算の達人” で数字で話せる人を目指す
“暗算の達人”で数字で話せる人を目指す“暗算の達人”で数字で話せる人を目指す
“暗算の達人” で数字で話せる人を目指す
 
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
PA.3/4 แบบข้อเสนอในการพัฒนางานตามหน้าที่และความรับผิดชอบ (สายงานนิเทศการศึกษา)
 
Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1Andalucia Nuestra Tierra 1
Andalucia Nuestra Tierra 1
 
TEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMESTEANO - PREVISIÓ DE PROBLEMES
TEANO - PREVISIÓ DE PROBLEMES
 
Presentación1
Presentación1Presentación1
Presentación1
 
Whpmt1 me0059 d2
Whpmt1 me0059 d2Whpmt1 me0059 d2
Whpmt1 me0059 d2
 
Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]Cópia de f 07.78 u 06 perfil usitep[1]
Cópia de f 07.78 u 06 perfil usitep[1]
 
Valores
ValoresValores
Valores
 
A que cheiram as cores
A que cheiram as coresA que cheiram as cores
A que cheiram as cores
 
#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentas#ebc #negocios #FinaciacionDeVentas
#ebc #negocios #FinaciacionDeVentas
 
Cadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercialCadastro maebraz ver set 2010 - comercial
Cadastro maebraz ver set 2010 - comercial
 
Power Point
Power PointPower Point
Power Point
 

Semelhante a Lx3520322036

Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysisinventy
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...IJECEIAES
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIJSRD
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1bPRAWEEN KUMAR
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmIRJET Journal
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET Journal
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...IJCSIS Research Publications
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce IJECEIAES
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2IAEME Publication
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...IRJET Journal
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataIRJET Journal
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET Journal
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningNatasha Grant
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...ijtsrd
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataIOSR Journals
 

Semelhante a Lx3520322036 (20)

Web Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering AnalysisWeb Based Fuzzy Clustering Analysis
Web Based Fuzzy Clustering Analysis
 
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
Extensive Analysis on Generation and Consensus Mechanisms of Clustering Ensem...
 
Introduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering EnsembleIntroduction to Multi-Objective Clustering Ensemble
Introduction to Multi-Objective Clustering Ensemble
 
84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b84cc04ff77007e457df6aa2b814d2346bf1b
84cc04ff77007e457df6aa2b814d2346bf1b
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...
 
A Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means AlgorithmA Study of Efficiency Improvements Technique for K-Means Algorithm
A Study of Efficiency Improvements Technique for K-Means Algorithm
 
Chapter 5.pdf
Chapter 5.pdfChapter 5.pdf
Chapter 5.pdf
 
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
 
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
 
Cg33504508
Cg33504508Cg33504508
Cg33504508
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
Experimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithmsExperimental study of Data clustering using k- Means and modified algorithms
Experimental study of Data clustering using k- Means and modified algorithms
 
Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2Dynamic approach to k means clustering algorithm-2
Dynamic approach to k means clustering algorithm-2
 
Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...Density Based Clustering Approach for Solving the Software Component Restruct...
Density Based Clustering Approach for Solving the Software Component Restruct...
 
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace DataMPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
MPSKM Algorithm to Cluster Uneven Dimensional Time Series Subspace Data
 
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering AlgorithmIRJET- Review of Existing Methods in K-Means Clustering Algorithm
IRJET- Review of Existing Methods in K-Means Clustering Algorithm
 
A Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data MiningA Comparative Study Of Various Clustering Algorithms In Data Mining
A Comparative Study Of Various Clustering Algorithms In Data Mining
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
 
Enhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online DataEnhanced Clustering Algorithm for Processing Online Data
Enhanced Clustering Algorithm for Processing Online Data
 

Último

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Último (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Lx3520322036

  • 1. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 RESEARCH ARTICLE www.ijera.com OPEN ACCESS Data Mining Tool using Clustering Technique on Exploration Engine Dataset Mahesh Kumar KondReddy, Sujeeth .T Dept. of Computer Science &Engineering, Sri Venkateswara University,Tirupati, Andhra Pradesh, India ABSTRACT It is a major issue to retrieve good websites from the larger collections of websites. As the number of available Web pages grows, it is become more difficult for users finding documents relevant to their interests. Clustering is the classification of a data set into subsets (clusters), so that the data in each subset share some common trait often proximity according to some defined distance measure. By clustering we improve the quality of websites by grouping similar websites in groups. This paper addresses the applications of data mining tool Weka by applying k means clustering to find clusters from huge data sets and find the attributes that govern optimization of search engines. Unlabeled document collections are becoming increasingly common and mining such databases becomes a major challenge. Keywords—Websites; Data mining; Weka; k-means, Dataset. I. INTRODUCTION The Web has experienced continuous growth since its creation. As of March 2002, the largest search engine contained approximately 968 million indexed pages in its database. Finding the right information from such a large collection is extremely difficult [3].Information extraction plays a vital role in today's life. How efficiently and effectively the relevant documents are extracted from World Wide Web is a challenging issue. As today's search engine does just string matching, documents retrieved may not be so relevant according to user's query. By clustering the websites, the websites having values of attributes are in particular range are grouped together [6]. Data is collected from various websites source code like their title length, number of keywords in title, url length, number of backlinks etc and based on this we derive the conclusion. A popular technique for clustering is based on K-means such that the data is partitioned into K clusters. In this method, the groups are identified by a set of points that are called the cluster centers. II. DOCUMENT CLUSTERING Document clustering analysis plays an important role in document mining research. A widely adopted definition of optimal clustering is a partitioning that minimizes distances within a cluster and maximizes distances between clusters. In this approach the clusters and, to a limited degree, relationships between clusters are derived automatically from the documents to be clustered, and the documents are subsequently assigned to those clusters [1].Users are known to have www.ijera.com difficulties in dealing with information retrieval search outputs especially if the outputs are above a certain size. Clustering can enable them to find the relevant documents more easily and also help them to form an understanding of the different facets of the query that have been provided for their inspection. This project aimed to investigate the websites that are in top 5 in one cluster and other sites in second cluster. Clustering based on k-means is closely related to a number of other clustering and location problems. These include the Euclidean kmedians , in which the objective is to minimize the sum of distances to the nearest center, and the geometric k-center problem in which the objective is to minimize the maximum distance from every point to its closest center. A. K means Algorithm K-Means clustering is a very popular algorithm to find the clustering in dataset by iterative computations. It has the advantages of simple implementing and finding at least local optimal clustering. K-Means algorithm is employed to find the clustering in dataset. The algorithm [2],[9] is composed of the following steps:  Initialize k cluster centers to be seed points. (These centers can be randomly produced or use other ways to generate).  For each sample, find the nearest cluster center, put the sample in this cluster and recompute centers of the altered cluster (Repeat n times).  Exam all samples again and put each one in 2032 | P a g e
  • 2. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 the cluster identified with the nearest center (don’t recompute any cluster centers). If members of each cluster haven’t been changed, stop. If changed, go to step 2. III. THE METHOD FOR FINDING MIXTURE STRUCTURE IN HETEROGENEOUS MULTIVARIATE DATA SET A. Partly And Completely Overlapped Group Structures In Heteregenous Multivariate Data Set Partly and completely overlapped group structures make it difficult to determine the number and structure of clusters in multivariate data set. The cases of partly and completely overlapped group structures in heterogeneous data are shown in Figure 01. It shows also mixture structure in data. The first case is partly group overlapping, in this case group means are different but very close to each other. Group variances are the same as shown in Figure 01(a). The second case is completely group overlapping, in this case group means are the same. Group variances are different from each other as shown in Figure 01(b). The horizontal axis denotes the values of independent variable and the vertical axis denotes the values of dependent variable with respect to the values of independent variable having heterogeneous structure thus having mixture structure in Figure 01. There is only one independent variable having two subgroups in Figure 01. other. B. Algorithm For Diagnosing Mixture Structure In Multivariate Heteregoneous Data Set And Refining Groups Model selection methods based on information criterions is applied for diagnosing the mixture structure in heterogeneous multivariate data set. The algorithm given in this section is a new algorithm. The steps of the algorithm proposed for refining groups in data using dynamic model based clustering is given below: 1. Apply model based clustering for multivariate data set by assuming that multivariate data have mixture structure thus data comes from a mixture of multivariate normal densities. This step shows either multivariate data is homogeneous thus there is no mixture structure or multivariate data have a group structure thus there is a mixture structure. 2. If multivariate data have mixture structure or multivariate data contains groups then find these groups in multivariate data. Test each group found in multivariate data for homogeneity. In other words check each group found for further mixture structure or subgroups. IV. ORGANIZATION OF DATA It’s important for search engine to maintain a high quality websites. This will improve the optimization. We made a database in which following attributes we take length of title, keywords in title, Domain length, and number of backlinks and Top rank website . A. Working with Weka on Dataset Open Weka, and then click on right side option explorer then Open data file under preprocess option which is in csv or arff format [4],[5]. As we choose the explorer option it will appear as given below, the screen shot in fig 1. Clearly indicate the open file option. Now we click on view open file and choose the data set. Weka provides filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in Weka .This is because Weka Simple K Means algorithm automatically handles a mixture of categorical and numerical attributes. This algorithm automatically normalizes numerical attributes when doing distance computations [7]. A B Figure 01. (a) The case of partly overlapping: group means are different but very close to each other. Group variances are the same. (b) The case of completely overlapping: group means are the same. Group variances are different from each www.ijera.com www.ijera.com This gives all attributes that are present in dataset. We can select any one which we want to include or select all. 2033 | P a g e
  • 3. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 www.ijera.com Fig 3: Choose parameters Fig.1: Opening page After this just click on cluster tab and click on choose button on left side and select clustering algorithm which we want to apply, we select simple k means the screen appears below in fig 2. [8] Note that, in general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try different values and evaluate the results. [10]. Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster (so, each dimension value in the centroid represents the mean value for that dimension in the cluster). Thus, centroids can be used to characterize the clusters. Fig.2: Select algorithm Next, click on the text box to the right of the "Choose" button to get the pop-up window shown in Fig 3, for editing the clustering parameter. In the pop-up window we enter 2 as the number of clusters and we leave the value of "seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. The result shows that in cluster 0 there are 13 websites that have length of title 59 characters long, keywords in title are 5, url length 22 characters long and number of backlinks are 6638 and in cluster 1 there are 16 websites that have length of title 36 characters long, keywords in title are 3, url length 29 characters long and number of backlinks are 19163 as shown in fig 4. Another way of understanding the characteristics of each cluster is through visualization. We can do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize cluster assignments". This pops up the visualization window as shown in Fig 5. In this, we choose the cluster number and any of the other attributes for each of the three different dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in a visual rendering of different relationships within each cluster. www.ijera.com 2034 | P a g e
  • 4. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 www.ijera.com Fig. 4: Result of clustering Fig.5: Visual result www.ijera.com 2035 | P a g e
  • 5. M. K. Kond Reddy et al Int. Journal of Engineering Research and Application ISSN : 2248-9622, Vol. 3, Issue 5, Sep-Oct 2013, pp.2032-2036 www.ijera.com In the above example, we have chosen the cluster number as the x-axis, the instance number (assigned by Weka) as the y-axis, and the "length of title" attribute as the color dimension. This will result in a visualization of the distribution of length of title in two clusters. As more and more data is collected from websites we can get more detail and can find attributes as by this method we find backlinks > 19000 , length of title < 40 , keywords in title > 3 and Domain length < 30 is good for search engine optimization REFERENCES [1] A Document Clustering Algorithm for Web Search Engine Retrieval System, Hongwei Yang School of Software Yunnan University, Kunming 650021, China; [2] S. Kantabutra, Efficient Representation of Cluster Structure in Large Data Sets, Ph.D. Thesis, Tufts University, Medford MA, September 2001. [3] Wang Jun, OuYang Zheng-Zheng “The Research of KMeans Clustering Algorithm Based on Association Rules “ [4]http://maya.cs.depaul.edu/classes/ect584/weka/ k-means.html [5].http://www.cs.ccsu.edu/~markov/wekatutorial. pdf. [6]http://thesai.org/Downloads/Volume3No4/Pape r_20 Knowledge_Discovery_in_Health_Care_Dat asets Using_Data_Mining_Tools.pdf [7]www.gtbit.org/downloads/dwdmsem6/dwdmse m6lman.pdf [8]http://www.iasri.res.in/ebook/win_school_aa/n otes/WEKA.pdf [9] R. Kannan, S. Vempala, and Adrian Vetta, “On ClusteringsGood, Bad, and Spectral”, Proc. of the 41st Foundations of Computer Science, Redondo Beach, 2000.5. [10]http://www.bvicam.ac.in/news/INDIACom% 202010%20Proceedingspapers/Group3/IND IACom10_388_Paper.pdf. www.ijera.com 2036 | P a g e