SlideShare uma empresa Scribd logo
1 de 11
Baixar para ler offline
BY USING FEEDBACK AND K-MEAN CLUSTERING FOR
                REFINE WEB DATA


            Abstract                    1. Introduction
       Now a day’s more web sites              The explosive growth of
are developed by everyone. Among        information sources available on the
them user cannot get accurate data      World Wide Web, it has become
that user required by searching on      increasingly necessary for users to
web. In basically web mining can be     utilize automated tools in find the
done by some page ranking               desired information resources, and to
algorithms are many more. In this       track and analyze their usage
paper , user going to refine the web    patterns. These factors give rise to
pages by giving feed back or any        the necessity of creating server side
rating     by    manually    or    by   and client side intelligent systems
automatically. K-mean clustering        that can effectively mine for
algorithm is basic algorithm used day   knowledge. Web mining can be
to day life. We have proposed genetic   broadly defined as the discovery and
algorithm to improve cluster quality    analysis of useful information from
and also accurate clusters. By also     the World Wide Web. This describes
apply the weblogs to our paper to       the automatic search of information
more refine. Web mining using           resources available online, i.e. Web
feedback is eliminating the unwanted    content mining, and the discovery of
sites in web and also it help for       user access patterns from Web
improving the user data in developing   servers, i.e., Web usage mining.
sites.
                                              There are roughly three
KEY      WORDS:      Web      mining,   knowledge discovery domains that
clustering ,k-mean, web logs.           pertain to web mining: Web Content
                                        Mining, Web Structure Mining, and
                                        Web Usage Mining. Web content
                                        mining is the process of extracting
                                        knowledge from the content of
                                        documents or their descriptions. Web
                                        document text mining, resource
discovery based on concepts indexing    about activities performed by a user
or agent based technology may also      from the moment the user enters a
fall in this category. Web structure    Web site to the
mining is the process of inferring      moment the same user leaves it. The
knowledge from the Worldwide Web        records of users’ actions within a
organization and links between          Web site are stored in a log file. Each
references and referents in the Web.    record in the log file contains the
Finally, web usage mining, also         client’s IP address, the date and time
known as Web Log Mining, is the         the request is received, the
process of extracting interesting       requested      object     and     some
patterns in web access logs.            additional information -such as
                                        protocol of request, size of the object
                                        etc. Figure 1 presents a sample of a
                                        Web access log file from a Web
                                        server.

                                        Figure 1: A sample of Web Server Log
                                        File
                                         141.243.1.172 [29:23:53:25] "GET /Software.html
                                         HTTP/1.0" 200 1497
                                         query2.lycos.cs.cmu.edu [29:23:53:36] "GET
                                         /Consumer.html HTTP/1.0" 200 1325
                                         tanuki.twics.com [29:23:53:53] "GET /News.html
                                         HTTP/1.0" 200 1014
                                         wpbfl2-45.gate.net    [29:23:54:15]     "GET   /
                                         HTTP/1.0" 200 4889
                                         wpbfl2-45.gate.net       [29:23:54:16]      "GET
                                         /icons/circle_logo_small.gif HTTP/1.0" 200
                                         2624
                                         wpbfl2-45.gate.net       [29:23:54:18]      "GET
We can broadly categorize Web data
clustering into (i) users’ sessions-
based and (ii) link-based. The former          The     standard      K-Means
uses the Web log data and tries to      algorithm was used to cluster user’s
group together a set of users’          traversal paths . However, it is not
navigation sessions having similar      clear how the similarity measure was
characteristics. In this framework,     devised and whether the clusters are
Web-log data provide information        meaningful.      Associations    and
sequential patterns between web            neighbor queries in the algorithm can
transactions are discovered based on       accelerate it. In addition, the number
Apriori algorithm . A good survey on       of distance calculations increases
clustering algorithms can be found .       exponentially with the increase of the
The k-means algorithm is one of the        dimensionality of the data .
most     widely     used    clustering
algorithms. The algorithm partitions              Many algorithms have been
the data points (objects) into k           proposed to accelerate the k-means.
groups (clusters), so as to minimize       The use of kd-trees is suggested to
the sum of the squared) distances          accelerate the k-means. However,
between the data points and the            backtracking is required, a case in
center (mean) of the clusters.             which the computation complexity is
To apply the k-means algorithm:            increased . Kd-trees are not efficient
                                           for higher dimensions. Furthermore,
        • Choose k data points to          it is not guaranteed that an exact
initialize the clusters                    match of the nearest neighbor can be
        • For each data point, find the    found unless some extra search is
nearest cluster center that is closest     done as discussed . Elkan suggests
and                                        the use of triangle inequality to
          Assign that data point to the    accelerate the k-means. It is
corresponding cluster                      suggested      to     use      R-Trees.
        • Update the cluster centers in    Nevertheless, R-Trees may not be
each cluster using the mean of the         appropriate for higher dimensional
data                  points which are     problems.The Partial Distance (PD)
assigned to that cluster                   algorithm has been proposed. The
        • Repeat steps 2 and 3 until       algorithm allows early termination of
there are not more changes in the          the     distance     calculation    by
values of the           Means.             introducing a premature exit
                                           condition in the search process.
      In spite of its simplicity, the k-
means algorithm involves a very large             As seen in the literature, the
number of nearest neighbor queries.        researchers contributed only to
The high time complexity of the k-         accelerate the algorithm; there is no
means algorithm makes it impractical       contribution in cluster refinement. In
for use in the case of having a large      this study, we propose a new
number of points in the data set.          algorithm to improve the k-means
Reducing the large number of nearest       clustering in web usage data mining.
The proposed algorithm consists of                  This field can automatically fill
two steps. In the first step, to avoid       up     by      system    programming
local minima, we presented a simple          algorithms
and efficient method to select initial
centroids based on mode value of the         Modified access logs
data vector. And the k-means
algorithm is applied to cluster the                  The modified web server logs
data vectors. Then in the second             are consists of these records :(i)
step, Genetic Algorithm (GA) is              User’s IP address, (ii) Access time, (iii)
applied to refine the cluster to             Request method (“GET”, “POST”, …,
improve the quality of the clusters of       etc), (iv) URL of the page accessed, (v)
users’ sessions.                             Protocol (typically HTTP/1.0), (vi)
                                             Number of bytes (vii) rating or
       The paper is organized as             feedback.
follows: the following section defines
the web access logs. Section 3                     The last field is for rating to
presents the standard k-means                that site this site can be useful for
algorithm. Section 4 is proposed             user requirements are not .this make
cluster refinement algorithm with            help full for refinement of web data
Genetic Algorithm (GA) to improve
the users’ session clusters.The                     Rating sites typically show a
experiments and results and the work         series of images (or other content) in
is concluded                                 random fashion, or chosen by
                                             computer algorithm, rather than
2. Web Access Logs:                          allowing users to choose. They then
                                             ask users for a rating or assessment,
                                             which is generally done quickly and
Basic access logs
                                             without great deliberation. Users
                                             score items on a scale of 1 to 10, yes
        In general the web server logs
                                             or     no.     Others,     such     as
are consists of these records :(i)
                                             BabeVsBabe.com, ask users to
User’s IP address, (ii) Access time, (iii)
                                             choose between two pictures.
Request method (“GET”, “POST”, …,
                                             Typically, the site gives instant
etc), (iv) URL of the page accessed, (v)
                                             feedback in terms of the item's
Protocol (typically HTTP/1.0), (vi)
                                             running score, or the percentage of
Number of bytes.
                                             other users who agree with the
                                             assessment. They sometimes offer
aggregate statistics or "best" and       automatically. In our experiments, we
"worst" lists. Most allow users to       run k-means using the correct cluster
submit their own image, sample, or       number.
other relevant content for others to
rate. Some require the submission as     1. Choose a number of clusters K.
a condition of membership.               2. Initialize cluster centers n1,… nk.
                                                        a. Could pick k data
                                                        points and set cluster
3. Standard              K-Means                        centers to these
Algorithm                                                 Points
                                                        b. Or could randomly
                                                 assign points to clusters and
        One of the most popular
                                                 take Means of clusters
clustering techniques is the k-means
                                         3. For each data point, compute the
clustering algorithm. Starting from a
                                         cluster center it is closest to (using
random partitioning, the algorithm
                                         some distance measure) and assign
repeatedly (i) computes the current
                                         the data point to this cluster.
cluster centers (i.e. the average
                                         4. Re-compute cluster centers (mean
vector of each cluster in data space)
                                         of data points in cluster)
and (ii) reassigns each data item to
                                         5. Stop when there are no new re-
the cluster whose centre is closest to
                                         assignments.
it. It terminates when no more
reassignments take place. By this
means, the intra-cluster variance,
that is, the sum of squares of the       4. Genetic Algorithm
differences between data items and
their associated cluster centers is             The initial cluster centers are
locally minimized. k -means’ strength    normally chosen either sequentially
is its runtime, which is linear in the   or randomly as given in the standard
number of data elements, and its         algorithm. The quality of the final
ease of implementation. However,         clusters based on these initial seeds.
the algorithm tends to get stuck in      It may leads to local minimum; this is
suboptimal solutions (dependent on       one of disadvantage in k-means
the initial partitioning and the data    clustering. To avoid this, in our
ordering) and it works well only for     method, we are selecting the modes
spherically shaped clusters. It          of the data vector as initial cluster
requires the number of clusters to be    centers. Based on the number of
provided or to be determined (semi-)     clusters, the modes are selected one
after another. Initially the first mode   considered as input to our
value is selected as the center for the   refinement algorithm. Initially a
first cluster and the next highest        random point is selected from each
frequently occurred value is (next        cluster; with this a chromosome is
mode value) assigned as the center        build. Like this an initial population
for next cluster.                         with 10 chromosomes is build. For
                                          each chromosome the entropy is
                                          calculated as fitness value and the
       Genetic algorithm (GA) is          global minimum is extracted. With
randomized search and optimization        this initial population, the genetic
techniques guided by the principles       operators such as reproduction,
of evolution and natural genetics,        crossover and mutation are applied
having a large amount of implicit         to produce a new population. While
parallelism. GA perform search in         applying crossover operator, the
complex, large and multimodal             cluster points will get shuffled means
landscapes, and provide near-optimal      that a point can move from one
solutions for objective or fitness        cluster to another. From this new
function of an optimization problem.      population, the local minimum fitness
                                          value is calculated and compared
        In this algorithm search space    with global minimum. If the local
are encoded in the form of strings        minimum is less than the global
(called chromosomes). The basic           minimum then the global minimum is
reason for our refinement is, in any      assigned with the local minimum, and
clustering algorithm the obtained         the next iteration is continued with
clusters will never gives us 100%         the new population. Otherwise, the
quality. There will be some errors        next iteration is continued with the
known as misclustered. That is, a data    same old population. This process is
item can be wrongly clustered. These      repeated for N number of iterations.
kinds of errors can be avoided by
using our refinement algorithm. GA        From the following section, it is
have applications in fields as diverse    shown that our refinement algorithm
as VLSI design, image processing,         improves the cluster quality. The
neural networks, machine learning,        algorithm is given as:
job shop scheduling, etc.
                                          1. Choose a number of clusters k
     The cluster obtained from            2. Initialize cluster centers n1,… nk
improved k-means clustering is            based on mode
3. For each data point, compute the         which are collected from various web
cluster center it is closest to (using      servers.
some distance measure) and assign
the data point to this cluster.             • EPA-HTTP - a day of HTTP logs from
4. Re-compute cluster centers (mean         a busy WWW server.
of data points in cluster)                  • SDSC-HTTP - a day of HTTP logs
5. Stop when there are no new re-           from a busy WWW server.
assignments.                                • Calgary-HTTP - a year of HTTP logs
6. GA based refinement                      from a CS departmental WWW
      a.    Construct       the   initial   server.
population (p1)                             • ClarkNet-HTTP - two weeks of HTTP
      b.    Calculate      the   global     logs from a busy Internet service
minimum (Gmin)                              provider WWW server.
      c. For i = 1 to N do                  • NASA-HTTP - two months of HTTP
             i. Perform reproduction        logs from a busy WWW server.
             ii. Apply the crossover        • Saskatchewan-HTTP - seven months
      operator between each parent.         of HTTP logs from a University WWW
             iii. Perform mutation and      server.
      get the new population. (p2)
             iv. Calculate the local        The following table gives a brief
      minimum (Lmin).                       description about each web access
             v. If Gmin < Lmin then         log sets.
                     a. Gmin = Lmin;
                     b. p1 = p2;
      d. Repeat
                                            Table 1: Internet Traffic Archive
                                            (Web Usage Data)
5. Experiments
                                                                          No. of Time
                                                 Server        Location
                                                                           Requests            From
We have generated clusters using                               Canada                   00:00:00 June
                                              Saskatchewan                 2,408,625
both the algorithms for several
                                                               Florida                  00:00:00 July
different logs obtained from the                 NASA                      3,461,612
internet          traffic       archive         Calgary
                                                               Alberta,
                                                                           726,739
                                                                                          October 24
                                                               Canada
(http://ita.ee.lbl.gov/). The following
six different web access log data sets
used to test our proposed method,
All the above logs are taken with the      that have a close relationship in that
timestamps have 1 second resolution.       they both try to minimize the within-
The logs fully preserve the originating    cluster scatter while maximizing the
host and HTTP request. And these           between-cluster separation in order
traces can be freely distributed. The      to find compact and well separated
logs are an ASCII file with one line per   clusters.
request, with the following columns:
1. host making the request. A              The Dunn Index The index is defined
hostname or the Internet address.          by the following equation for a
2. timestamp in the format "DAY            specific number of clusters
MON DD HH:MM:SS YYYY".                                            
                                                                  
                                                                                     d (C , C ) 
                                                                                                       
                                                                                              i j
3. request given in quotes.                D n ,c   = min  min                                        
                                                                                    kmaxnc diam (c k ) 
                                                     i = ,..., nc
                                                        1           j =i +1,..., nc
                                                                  
                                                                                    =1,...,           
4. HTTP reply code.
5. bytes in the reply.                     where d(ci, cj) is the dissimilarity
                                           function between two clusters ci and
Since various clustering algorithms        cj defined as
result in different clusters it is         d (ci , c j ) = min d ( x, y )
                                                             x∈ci , y∈c j
important to perform an evaluation         and diam(c) is the diameter of a
of the results to assess their quality.    cluster, which may be considered as a
In clustering, the procedure of            measure of dispersion of the clusters.
evaluating the results is known as         The diameter of a cluster C can be
cluster validation and can be based        defined as follows:
on various measures called validity        diam (C ) = min d ( x, y )
                                                             x , y∈C
measures. The validity measures are
                                           It is clear that if the dataset contains
divided in two categories depending
                                           compact and well-separated clusters,
on whether they have any reference
                                           the distance between the clusters is
to external knowledge. By external
                                           expected to be large and the
knowledge we refer to a pre-
                                           diameter of the clusters is expected
specified structure which reflects our
                                           to be small. Thus, based on the
intuition about the clustering
                                           Dunn’s index definition, we may
structure of a data set. The measures
                                           conclude that large values of the
that have no reference to external
                                           index indicate the presence of
knowledge are called internal quality
                                           compact and well-separated clusters.
measures and they are estimated in
                                           5.2. DB Index
terms of quantities that involve the
                                           Given that K is the number of
data set. Dunn’s index and DB index
                                           clusters, Ci and Cj are the closest
are two internal quality measures
                                           clusters according to average
distance d and diam is the diameter                              separately it is also type of page
of a cluster, the DB index is defined                            ranking algorithm.
as follows:
           1   K              diam (C i ) + diam (C j ) 
    DB =
           K
               ∑max                 d (C i , C j )
                                                         
               i =1
                      j ≠i
                             
                                                        
                                                         
It is clear for the above definition that                        6.Conclusions And Future
DB is the average similarity between
each cluster and its most similar one.                           Work:
It is desirable for the clusters to have
the minimum possible similarity to
each other; therefore we seek                       Web usage mining applies data
clustering that minimizes DB.                       mining techniques to discover usage
                                                    patterns from the Web data, In this
                                                    paper we have Proposed a new
Each access to a Web page is                        method for data logs by adding rating
recorded in the access log of the                   field it will helpful for web mining and
Web server that hosts it. The                       also for users In the first step, the
entries of a Web log file consist                   initial cluster centers are selected
of fields that follow a predefined                  based on statistical mode based
format. The fields of the common                    calculation to allow the iterative
log format are:                                     algorithm to converge to a “better”
                                                    local minimum. And in the second
 S.                               Request
                                                    step, we have proposed a novel
         IP address   Access time
NO                                method            method to improve to cluster quality
      115.242.159.123
                        Apr 08,                     using Genetic Algorithm (GA) based
 1                    2002 08:46    GET   http://www.yaledailynews.com
                          PM                        refinement algorithm. The proposed
      125.242.149.122
                        Apr 08,                     thing is to add the feedback field to
 2                    2002 08:43   POST      http://www.waterski.com
                          PM                        log format.
                                   Apr 08,
       234.222.111.152
3                                2002 08:40     GET          http://www.sony.com
                                    PM                           By this feedback we can separate the
                                                                 unwanted sites for that we can
By apply the rating into log file                                develop the an effective algorithm
format we will find out the worth of                             and also based on time user can
the site. Using this site developer also                         search the data in single site for long
put effort in developing. Periodically                           period of time by using any
doing the web mining on the web                                  algorithms automatically generate
data the low rated site kept                                     rating for that blogs. Future work is
to developing an efficient algorithm             [17] Y. Fu, K. Sandhu, and M-Y Shih.
for this.                                       Clustering of Web users based on access
                                                patterns. In
                                                Proceedings of WEBKDD, 1999.
7.References:                                    [20] B. Hay, K Vanhoof, and G. Wetsr
                                                Clustering navigation patterns on a Website
                                                using a sequence
[1] R. Agrawal and R. Srikant, “Fast            alignment method. In Proceedings of 17th
algorithms for mining association rules,”       International Joint Conference on Artificial
Proc. of the 20th                               Intelligence, Seattle,Washington, USA,
VLDB Conference, pp. 487- 499, Santiago,        August, 2001.
Chile, 1994.                                    Refinement of Web usage Data Clustering
 [6] I. V. Cadez, D. Heckerman, C. Meek, P.     from K-means with Genetic Algorithm 489
Smyth, and S. White. Model-based                 [26] T. Kanungo, D.M. Mount, N.
clustering and                                  Netanyahu, C. Piatko, R. Silverman, and
visualization of navigation patterns on a       A.Y. Wu, An efficient
Web site. Data Mining and Knowledge             k-means clustering algorithm: Analysis and
Discovery,                                      implementation. IEEE Trans. Pattern
7(4):399-424, 2003.                             Analysis and
[7] S. Chakrabarti. Mining the Web. Morgan      Machine Intelligence, 24 (7): 881-892, 2002.
Kaufmann, 2003.                                  [30] Z. Michalewicz, “Genetic Algorithms,
[8] Z. Chen, A.Wai-Chee Fu, and F. Chi-         Data Structures" Evolution Programs,
Hung Tong. Optimal algorithms for finding       Springer, New
user access                                     York, 1992.
sessions from very large Web logs. World         [34] O. Nasraoui, H. Frigui, A. Joshi, and
Wide Web: Internet and Information              R. Krishnapuram, “Mining Web Access
Systems,                                        Logs Using
6:259-279, 2003.                                Relational Competitive Fuzzy Clustering”,
[9] D. Cheng, B. Gersho, Y. Ramamurthi,         to be presented at the Eight International
and Y. Shoham, Fast Search Algorithms for       Fuzzy
Vector                                          Systems Association World Congress -
Quantization and Pattern Recognition.           IFSA 99, Taipei, August 99.
Proceeding of the IEEE International             [36] S. Oyanagi, K. Kubota, A. Nakase,
Conference on                                   Application of matrix clustering to web log
Acoustics, Speech and Signal Processing,        analysis and
1:1-9, 1984.                                    access prediction, in: WEBKDD2001—
 [12] N. Eiron and K. S. McCurley.              MiningWeb LogDataAcrossAll Customers
Untangling compound documents on                Touch
theWeb. In Proceedings of                       Points, Third InternationalWorkshop, 2001.
ACM Hypertext,, pages 85-94, 2003.               [39] C. Shahabe, A. M. Zarkesh, J. Abidi
                                                and V. Shah, “Knowledge discovery from
[15] J.L.R. Filho, P.C. Treleaven, C. Alippi,   user’s web-page
Genetic algorithm programming                   navigation,” Proc. Seventh IEEE Intl.
environments, IEEE                              Workshop on Research Issues in Data
Comput. 27:28-43,1994.                          Engineering
                                                (RIDE), 20-29, 1997.
WEBKDD 2001—Mining Web Log Data
Across All Customers Touch Points, Third
International Workshop, San Francisco, CA,
USA, August 26, 2001. Revised papers, vol.
2356
of Lecture Notes in Comp Sc, Springer,
113–144, 2002.
 [44] J. Srivastava, R. Cooley, M.
Deshpande, and P. Tan, Web Usage Mining:
Discovery and
Applications of Usage Patterns from Web
Data, in SIGKDD Explorations, 1(2):1-12,
2000.
 [46] Xu R., and Wunsch D., Survey of
clustering algorithms. IEEE Trans. Neural
Networks, 16 (3):
645-678, 2005.

Mais conteúdo relacionado

Mais procurados

Maintaining Data Integrity for Shared Data in Cloud
Maintaining Data Integrity for Shared Data in Cloud Maintaining Data Integrity for Shared Data in Cloud
Maintaining Data Integrity for Shared Data in Cloud IJERA Editor
 
Ieeepro techno solutions ieee java project - oruta privacy-preserving public...
Ieeepro techno solutions  ieee java project - oruta privacy-preserving public...Ieeepro techno solutions  ieee java project - oruta privacy-preserving public...
Ieeepro techno solutions ieee java project - oruta privacy-preserving public...hemanthbbc
 
Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...IGEEKS TECHNOLOGIES
 
panda public auditing for shared data with efficient user revocation in the c...
panda public auditing for shared data with efficient user revocation in the c...panda public auditing for shared data with efficient user revocation in the c...
panda public auditing for shared data with efficient user revocation in the c...swathi78
 
Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...Pvrtechnologies Nellore
 
Integrity Privacy to Public Auditing for Shared Data in Cloud Computing
Integrity Privacy to Public Auditing for Shared Data in Cloud ComputingIntegrity Privacy to Public Auditing for Shared Data in Cloud Computing
Integrity Privacy to Public Auditing for Shared Data in Cloud ComputingIJERA Editor
 
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusRecent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusGlobus
 
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
 PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO... PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...Nexgen Technology
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...ijwscjournal
 
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...Zhenyun Zhuang
 
Removing Dust using Sequence Alignment and Content Matching
Removing Dust using Sequence Alignment and Content MatchingRemoving Dust using Sequence Alignment and Content Matching
Removing Dust using Sequence Alignment and Content MatchingIRJET Journal
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusGlobus
 

Mais procurados (16)

Maintaining Data Integrity for Shared Data in Cloud
Maintaining Data Integrity for Shared Data in Cloud Maintaining Data Integrity for Shared Data in Cloud
Maintaining Data Integrity for Shared Data in Cloud
 
Ieeepro techno solutions ieee java project - oruta privacy-preserving public...
Ieeepro techno solutions  ieee java project - oruta privacy-preserving public...Ieeepro techno solutions  ieee java project - oruta privacy-preserving public...
Ieeepro techno solutions ieee java project - oruta privacy-preserving public...
 
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
A Comparison between Flooding and Bloom Filter Based Multikeyword Search in P...
 
Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...Panda public auditing for shared data with efficient user revocation in the c...
Panda public auditing for shared data with efficient user revocation in the c...
 
50120140504006
5012014050400650120140504006
50120140504006
 
panda public auditing for shared data with efficient user revocation in the c...
panda public auditing for shared data with efficient user revocation in the c...panda public auditing for shared data with efficient user revocation in the c...
panda public auditing for shared data with efficient user revocation in the c...
 
Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...Public integrity auditing for shared dynamic cloud data with group user revoc...
Public integrity auditing for shared dynamic cloud data with group user revoc...
 
Integrity Privacy to Public Auditing for Shared Data in Cloud Computing
Integrity Privacy to Public Auditing for Shared Data in Cloud ComputingIntegrity Privacy to Public Auditing for Shared Data in Cloud Computing
Integrity Privacy to Public Auditing for Shared Data in Cloud Computing
 
A1803010105
A1803010105A1803010105
A1803010105
 
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using GlobusRecent Upgrades to ARM Data Transfer and Delivery Using Globus
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
 
SomeSlides
SomeSlidesSomeSlides
SomeSlides
 
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
 PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO... PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
PUBLIC INTEGRITY AUDITING FOR SHARED DYNAMIC CLOUD DATA WITH GROUP USER REVO...
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
 
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
Guarding Fast Data Delivery in Cloud: an Effective Approach to Isolating Perf...
 
Removing Dust using Sequence Alignment and Content Matching
Removing Dust using Sequence Alignment and Content MatchingRemoving Dust using Sequence Alignment and Content Matching
Removing Dust using Sequence Alignment and Content Matching
 
Automating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with GlobusAutomating Research Data Management at Scale with Globus
Automating Research Data Management at Scale with Globus
 

Destaque

Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...Alexander Decker
 
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajja
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajjaTrabajo de musica 3 diver adrian naranjo hernandez ajajjajajajja
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajjaadriannaranjo3
 
Improved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-MeansImproved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-MeansIJASCSE
 
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)IJERD Editor
 
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE IJORCS
 

Destaque (6)

Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...Comparative analysis of various data stream mining procedures and various dim...
Comparative analysis of various data stream mining procedures and various dim...
 
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajja
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajjaTrabajo de musica 3 diver adrian naranjo hernandez ajajjajajajja
Trabajo de musica 3 diver adrian naranjo hernandez ajajjajajajja
 
Improved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-MeansImproved Performance of Unsupervised Method by Renovated K-Means
Improved Performance of Unsupervised Method by Renovated K-Means
 
50120130406008
5012013040600850120130406008
50120130406008
 
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)
WHAT DOES MEAN “GOD”?... (A New theory on “RELIGION”)
 
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
QUALITY OF CLUSTER INDEX BASED ON STUDY OF DECISION TREE
 

Semelhante a By

Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage dataijfcstjournal
 
Farthest first clustering in links reorganization
Farthest first clustering in links reorganizationFarthest first clustering in links reorganization
Farthest first clustering in links reorganizationIJwest
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Editor IJARCET
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Editor IJARCET
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
 
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...ijdkp
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioINFOGAIN PUBLICATION
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's pptmak57
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining Editor IJMTER
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningIOSR Journals
 
Implementation of Intelligent Web Server Monitoring
Implementation of Intelligent Web Server MonitoringImplementation of Intelligent Web Server Monitoring
Implementation of Intelligent Web Server Monitoringiosrjce
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyIOSR Journals
 
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningIJMER
 

Semelhante a By (20)

Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
Farthest first clustering in links reorganization
Farthest first clustering in links reorganizationFarthest first clustering in links reorganization
Farthest first clustering in links reorganization
 
H0314450
H0314450H0314450
H0314450
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060
 
Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060Volume 2-issue-6-2056-2060
Volume 2-issue-6-2056-2060
 
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...
 
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
BIDIRECTIONAL GROWTH BASED MINING AND CYCLIC BEHAVIOUR ANALYSIS OF WEB SEQUEN...
 
a novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studioa novel technique to pre-process web log data using sql server management studio
a novel technique to pre-process web log data using sql server management studio
 
Avtar's ppt
Avtar's pptAvtar's ppt
Avtar's ppt
 
A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining A Comparative Study of Recommendation System Using Web Usage Mining
A Comparative Study of Recommendation System Using Web Usage Mining
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage mining
 
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori AlgorithmWeb Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm
 
Implementation of Intelligent Web Server Monitoring
Implementation of Intelligent Web Server MonitoringImplementation of Intelligent Web Server Monitoring
Implementation of Intelligent Web Server Monitoring
 
C017231726
C017231726C017231726
C017231726
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
Sree saranya
Sree saranyaSree saranya
Sree saranya
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 
Webmining ppt
Webmining pptWebmining ppt
Webmining ppt
 
L017418893
L017418893L017418893
L017418893
 
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web MiningA Novel Method for Data Cleaning and User- Session Identification for Web Mining
A Novel Method for Data Cleaning and User- Session Identification for Web Mining
 

Último

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 

Último (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 

By

  • 1. BY USING FEEDBACK AND K-MEAN CLUSTERING FOR REFINE WEB DATA Abstract 1. Introduction Now a day’s more web sites The explosive growth of are developed by everyone. Among information sources available on the them user cannot get accurate data World Wide Web, it has become that user required by searching on increasingly necessary for users to web. In basically web mining can be utilize automated tools in find the done by some page ranking desired information resources, and to algorithms are many more. In this track and analyze their usage paper , user going to refine the web patterns. These factors give rise to pages by giving feed back or any the necessity of creating server side rating by manually or by and client side intelligent systems automatically. K-mean clustering that can effectively mine for algorithm is basic algorithm used day knowledge. Web mining can be to day life. We have proposed genetic broadly defined as the discovery and algorithm to improve cluster quality analysis of useful information from and also accurate clusters. By also the World Wide Web. This describes apply the weblogs to our paper to the automatic search of information more refine. Web mining using resources available online, i.e. Web feedback is eliminating the unwanted content mining, and the discovery of sites in web and also it help for user access patterns from Web improving the user data in developing servers, i.e., Web usage mining. sites. There are roughly three KEY WORDS: Web mining, knowledge discovery domains that clustering ,k-mean, web logs. pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining. Web content mining is the process of extracting knowledge from the content of documents or their descriptions. Web document text mining, resource
  • 2. discovery based on concepts indexing about activities performed by a user or agent based technology may also from the moment the user enters a fall in this category. Web structure Web site to the mining is the process of inferring moment the same user leaves it. The knowledge from the Worldwide Web records of users’ actions within a organization and links between Web site are stored in a log file. Each references and referents in the Web. record in the log file contains the Finally, web usage mining, also client’s IP address, the date and time known as Web Log Mining, is the the request is received, the process of extracting interesting requested object and some patterns in web access logs. additional information -such as protocol of request, size of the object etc. Figure 1 presents a sample of a Web access log file from a Web server. Figure 1: A sample of Web Server Log File 141.243.1.172 [29:23:53:25] "GET /Software.html HTTP/1.0" 200 1497 query2.lycos.cs.cmu.edu [29:23:53:36] "GET /Consumer.html HTTP/1.0" 200 1325 tanuki.twics.com [29:23:53:53] "GET /News.html HTTP/1.0" 200 1014 wpbfl2-45.gate.net [29:23:54:15] "GET / HTTP/1.0" 200 4889 wpbfl2-45.gate.net [29:23:54:16] "GET /icons/circle_logo_small.gif HTTP/1.0" 200 2624 wpbfl2-45.gate.net [29:23:54:18] "GET We can broadly categorize Web data clustering into (i) users’ sessions- based and (ii) link-based. The former The standard K-Means uses the Web log data and tries to algorithm was used to cluster user’s group together a set of users’ traversal paths . However, it is not navigation sessions having similar clear how the similarity measure was characteristics. In this framework, devised and whether the clusters are Web-log data provide information meaningful. Associations and
  • 3. sequential patterns between web neighbor queries in the algorithm can transactions are discovered based on accelerate it. In addition, the number Apriori algorithm . A good survey on of distance calculations increases clustering algorithms can be found . exponentially with the increase of the The k-means algorithm is one of the dimensionality of the data . most widely used clustering algorithms. The algorithm partitions Many algorithms have been the data points (objects) into k proposed to accelerate the k-means. groups (clusters), so as to minimize The use of kd-trees is suggested to the sum of the squared) distances accelerate the k-means. However, between the data points and the backtracking is required, a case in center (mean) of the clusters. which the computation complexity is To apply the k-means algorithm: increased . Kd-trees are not efficient for higher dimensions. Furthermore, • Choose k data points to it is not guaranteed that an exact initialize the clusters match of the nearest neighbor can be • For each data point, find the found unless some extra search is nearest cluster center that is closest done as discussed . Elkan suggests and the use of triangle inequality to Assign that data point to the accelerate the k-means. It is corresponding cluster suggested to use R-Trees. • Update the cluster centers in Nevertheless, R-Trees may not be each cluster using the mean of the appropriate for higher dimensional data points which are problems.The Partial Distance (PD) assigned to that cluster algorithm has been proposed. The • Repeat steps 2 and 3 until algorithm allows early termination of there are not more changes in the the distance calculation by values of the Means. introducing a premature exit condition in the search process. In spite of its simplicity, the k- means algorithm involves a very large As seen in the literature, the number of nearest neighbor queries. researchers contributed only to The high time complexity of the k- accelerate the algorithm; there is no means algorithm makes it impractical contribution in cluster refinement. In for use in the case of having a large this study, we propose a new number of points in the data set. algorithm to improve the k-means Reducing the large number of nearest clustering in web usage data mining.
  • 4. The proposed algorithm consists of This field can automatically fill two steps. In the first step, to avoid up by system programming local minima, we presented a simple algorithms and efficient method to select initial centroids based on mode value of the Modified access logs data vector. And the k-means algorithm is applied to cluster the The modified web server logs data vectors. Then in the second are consists of these records :(i) step, Genetic Algorithm (GA) is User’s IP address, (ii) Access time, (iii) applied to refine the cluster to Request method (“GET”, “POST”, …, improve the quality of the clusters of etc), (iv) URL of the page accessed, (v) users’ sessions. Protocol (typically HTTP/1.0), (vi) Number of bytes (vii) rating or The paper is organized as feedback. follows: the following section defines the web access logs. Section 3 The last field is for rating to presents the standard k-means that site this site can be useful for algorithm. Section 4 is proposed user requirements are not .this make cluster refinement algorithm with help full for refinement of web data Genetic Algorithm (GA) to improve the users’ session clusters.The Rating sites typically show a experiments and results and the work series of images (or other content) in is concluded random fashion, or chosen by computer algorithm, rather than 2. Web Access Logs: allowing users to choose. They then ask users for a rating or assessment, which is generally done quickly and Basic access logs without great deliberation. Users score items on a scale of 1 to 10, yes In general the web server logs or no. Others, such as are consists of these records :(i) BabeVsBabe.com, ask users to User’s IP address, (ii) Access time, (iii) choose between two pictures. Request method (“GET”, “POST”, …, Typically, the site gives instant etc), (iv) URL of the page accessed, (v) feedback in terms of the item's Protocol (typically HTTP/1.0), (vi) running score, or the percentage of Number of bytes. other users who agree with the assessment. They sometimes offer
  • 5. aggregate statistics or "best" and automatically. In our experiments, we "worst" lists. Most allow users to run k-means using the correct cluster submit their own image, sample, or number. other relevant content for others to rate. Some require the submission as 1. Choose a number of clusters K. a condition of membership. 2. Initialize cluster centers n1,… nk. a. Could pick k data points and set cluster 3. Standard K-Means centers to these Algorithm Points b. Or could randomly assign points to clusters and One of the most popular take Means of clusters clustering techniques is the k-means 3. For each data point, compute the clustering algorithm. Starting from a cluster center it is closest to (using random partitioning, the algorithm some distance measure) and assign repeatedly (i) computes the current the data point to this cluster. cluster centers (i.e. the average 4. Re-compute cluster centers (mean vector of each cluster in data space) of data points in cluster) and (ii) reassigns each data item to 5. Stop when there are no new re- the cluster whose centre is closest to assignments. it. It terminates when no more reassignments take place. By this means, the intra-cluster variance, that is, the sum of squares of the 4. Genetic Algorithm differences between data items and their associated cluster centers is The initial cluster centers are locally minimized. k -means’ strength normally chosen either sequentially is its runtime, which is linear in the or randomly as given in the standard number of data elements, and its algorithm. The quality of the final ease of implementation. However, clusters based on these initial seeds. the algorithm tends to get stuck in It may leads to local minimum; this is suboptimal solutions (dependent on one of disadvantage in k-means the initial partitioning and the data clustering. To avoid this, in our ordering) and it works well only for method, we are selecting the modes spherically shaped clusters. It of the data vector as initial cluster requires the number of clusters to be centers. Based on the number of provided or to be determined (semi-) clusters, the modes are selected one
  • 6. after another. Initially the first mode considered as input to our value is selected as the center for the refinement algorithm. Initially a first cluster and the next highest random point is selected from each frequently occurred value is (next cluster; with this a chromosome is mode value) assigned as the center build. Like this an initial population for next cluster. with 10 chromosomes is build. For each chromosome the entropy is calculated as fitness value and the Genetic algorithm (GA) is global minimum is extracted. With randomized search and optimization this initial population, the genetic techniques guided by the principles operators such as reproduction, of evolution and natural genetics, crossover and mutation are applied having a large amount of implicit to produce a new population. While parallelism. GA perform search in applying crossover operator, the complex, large and multimodal cluster points will get shuffled means landscapes, and provide near-optimal that a point can move from one solutions for objective or fitness cluster to another. From this new function of an optimization problem. population, the local minimum fitness value is calculated and compared In this algorithm search space with global minimum. If the local are encoded in the form of strings minimum is less than the global (called chromosomes). The basic minimum then the global minimum is reason for our refinement is, in any assigned with the local minimum, and clustering algorithm the obtained the next iteration is continued with clusters will never gives us 100% the new population. Otherwise, the quality. There will be some errors next iteration is continued with the known as misclustered. That is, a data same old population. This process is item can be wrongly clustered. These repeated for N number of iterations. kinds of errors can be avoided by using our refinement algorithm. GA From the following section, it is have applications in fields as diverse shown that our refinement algorithm as VLSI design, image processing, improves the cluster quality. The neural networks, machine learning, algorithm is given as: job shop scheduling, etc. 1. Choose a number of clusters k The cluster obtained from 2. Initialize cluster centers n1,… nk improved k-means clustering is based on mode
  • 7. 3. For each data point, compute the which are collected from various web cluster center it is closest to (using servers. some distance measure) and assign the data point to this cluster. • EPA-HTTP - a day of HTTP logs from 4. Re-compute cluster centers (mean a busy WWW server. of data points in cluster) • SDSC-HTTP - a day of HTTP logs 5. Stop when there are no new re- from a busy WWW server. assignments. • Calgary-HTTP - a year of HTTP logs 6. GA based refinement from a CS departmental WWW a. Construct the initial server. population (p1) • ClarkNet-HTTP - two weeks of HTTP b. Calculate the global logs from a busy Internet service minimum (Gmin) provider WWW server. c. For i = 1 to N do • NASA-HTTP - two months of HTTP i. Perform reproduction logs from a busy WWW server. ii. Apply the crossover • Saskatchewan-HTTP - seven months operator between each parent. of HTTP logs from a University WWW iii. Perform mutation and server. get the new population. (p2) iv. Calculate the local The following table gives a brief minimum (Lmin). description about each web access v. If Gmin < Lmin then log sets. a. Gmin = Lmin; b. p1 = p2; d. Repeat Table 1: Internet Traffic Archive (Web Usage Data) 5. Experiments No. of Time Server Location Requests From We have generated clusters using Canada 00:00:00 June Saskatchewan 2,408,625 both the algorithms for several Florida 00:00:00 July different logs obtained from the NASA 3,461,612 internet traffic archive Calgary Alberta, 726,739 October 24 Canada (http://ita.ee.lbl.gov/). The following six different web access log data sets used to test our proposed method,
  • 8. All the above logs are taken with the that have a close relationship in that timestamps have 1 second resolution. they both try to minimize the within- The logs fully preserve the originating cluster scatter while maximizing the host and HTTP request. And these between-cluster separation in order traces can be freely distributed. The to find compact and well separated logs are an ASCII file with one line per clusters. request, with the following columns: 1. host making the request. A The Dunn Index The index is defined hostname or the Internet address. by the following equation for a 2. timestamp in the format "DAY specific number of clusters MON DD HH:MM:SS YYYY".    d (C , C )    i j 3. request given in quotes. D n ,c = min  min   kmaxnc diam (c k )  i = ,..., nc 1 j =i +1,..., nc    =1,...,  4. HTTP reply code. 5. bytes in the reply. where d(ci, cj) is the dissimilarity function between two clusters ci and Since various clustering algorithms cj defined as result in different clusters it is d (ci , c j ) = min d ( x, y ) x∈ci , y∈c j important to perform an evaluation and diam(c) is the diameter of a of the results to assess their quality. cluster, which may be considered as a In clustering, the procedure of measure of dispersion of the clusters. evaluating the results is known as The diameter of a cluster C can be cluster validation and can be based defined as follows: on various measures called validity diam (C ) = min d ( x, y ) x , y∈C measures. The validity measures are It is clear that if the dataset contains divided in two categories depending compact and well-separated clusters, on whether they have any reference the distance between the clusters is to external knowledge. By external expected to be large and the knowledge we refer to a pre- diameter of the clusters is expected specified structure which reflects our to be small. Thus, based on the intuition about the clustering Dunn’s index definition, we may structure of a data set. The measures conclude that large values of the that have no reference to external index indicate the presence of knowledge are called internal quality compact and well-separated clusters. measures and they are estimated in 5.2. DB Index terms of quantities that involve the Given that K is the number of data set. Dunn’s index and DB index clusters, Ci and Cj are the closest are two internal quality measures clusters according to average
  • 9. distance d and diam is the diameter separately it is also type of page of a cluster, the DB index is defined ranking algorithm. as follows: 1 K  diam (C i ) + diam (C j )  DB = K ∑max  d (C i , C j )  i =1 j ≠i     It is clear for the above definition that 6.Conclusions And Future DB is the average similarity between each cluster and its most similar one. Work: It is desirable for the clusters to have the minimum possible similarity to each other; therefore we seek Web usage mining applies data clustering that minimizes DB. mining techniques to discover usage patterns from the Web data, In this paper we have Proposed a new Each access to a Web page is method for data logs by adding rating recorded in the access log of the field it will helpful for web mining and Web server that hosts it. The also for users In the first step, the entries of a Web log file consist initial cluster centers are selected of fields that follow a predefined based on statistical mode based format. The fields of the common calculation to allow the iterative log format are: algorithm to converge to a “better” local minimum. And in the second S. Request step, we have proposed a novel IP address Access time NO method method to improve to cluster quality 115.242.159.123 Apr 08, using Genetic Algorithm (GA) based 1 2002 08:46 GET http://www.yaledailynews.com PM refinement algorithm. The proposed 125.242.149.122 Apr 08, thing is to add the feedback field to 2 2002 08:43 POST http://www.waterski.com PM log format. Apr 08, 234.222.111.152 3 2002 08:40 GET http://www.sony.com PM By this feedback we can separate the unwanted sites for that we can By apply the rating into log file develop the an effective algorithm format we will find out the worth of and also based on time user can the site. Using this site developer also search the data in single site for long put effort in developing. Periodically period of time by using any doing the web mining on the web algorithms automatically generate data the low rated site kept rating for that blogs. Future work is
  • 10. to developing an efficient algorithm [17] Y. Fu, K. Sandhu, and M-Y Shih. for this. Clustering of Web users based on access patterns. In Proceedings of WEBKDD, 1999. 7.References: [20] B. Hay, K Vanhoof, and G. Wetsr Clustering navigation patterns on a Website using a sequence [1] R. Agrawal and R. Srikant, “Fast alignment method. In Proceedings of 17th algorithms for mining association rules,” International Joint Conference on Artificial Proc. of the 20th Intelligence, Seattle,Washington, USA, VLDB Conference, pp. 487- 499, Santiago, August, 2001. Chile, 1994. Refinement of Web usage Data Clustering [6] I. V. Cadez, D. Heckerman, C. Meek, P. from K-means with Genetic Algorithm 489 Smyth, and S. White. Model-based [26] T. Kanungo, D.M. Mount, N. clustering and Netanyahu, C. Piatko, R. Silverman, and visualization of navigation patterns on a A.Y. Wu, An efficient Web site. Data Mining and Knowledge k-means clustering algorithm: Analysis and Discovery, implementation. IEEE Trans. Pattern 7(4):399-424, 2003. Analysis and [7] S. Chakrabarti. Mining the Web. Morgan Machine Intelligence, 24 (7): 881-892, 2002. Kaufmann, 2003. [30] Z. Michalewicz, “Genetic Algorithms, [8] Z. Chen, A.Wai-Chee Fu, and F. Chi- Data Structures" Evolution Programs, Hung Tong. Optimal algorithms for finding Springer, New user access York, 1992. sessions from very large Web logs. World [34] O. Nasraoui, H. Frigui, A. Joshi, and Wide Web: Internet and Information R. Krishnapuram, “Mining Web Access Systems, Logs Using 6:259-279, 2003. Relational Competitive Fuzzy Clustering”, [9] D. Cheng, B. Gersho, Y. Ramamurthi, to be presented at the Eight International and Y. Shoham, Fast Search Algorithms for Fuzzy Vector Systems Association World Congress - Quantization and Pattern Recognition. IFSA 99, Taipei, August 99. Proceeding of the IEEE International [36] S. Oyanagi, K. Kubota, A. Nakase, Conference on Application of matrix clustering to web log Acoustics, Speech and Signal Processing, analysis and 1:1-9, 1984. access prediction, in: WEBKDD2001— [12] N. Eiron and K. S. McCurley. MiningWeb LogDataAcrossAll Customers Untangling compound documents on Touch theWeb. In Proceedings of Points, Third InternationalWorkshop, 2001. ACM Hypertext,, pages 85-94, 2003. [39] C. Shahabe, A. M. Zarkesh, J. Abidi and V. Shah, “Knowledge discovery from [15] J.L.R. Filho, P.C. Treleaven, C. Alippi, user’s web-page Genetic algorithm programming navigation,” Proc. Seventh IEEE Intl. environments, IEEE Workshop on Research Issues in Data Comput. 27:28-43,1994. Engineering (RIDE), 20-29, 1997.
  • 11. WEBKDD 2001—Mining Web Log Data Across All Customers Touch Points, Third International Workshop, San Francisco, CA, USA, August 26, 2001. Revised papers, vol. 2356 of Lecture Notes in Comp Sc, Springer, 113–144, 2002. [44] J. Srivastava, R. Cooley, M. Deshpande, and P. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, in SIGKDD Explorations, 1(2):1-12, 2000. [46] Xu R., and Wunsch D., Survey of clustering algorithms. IEEE Trans. Neural Networks, 16 (3): 645-678, 2005.