This document provides an overview of a major seminar on knowledge discovery from web logs. It discusses how analyzing vast amounts of web site traversal data stored in web logs can reveal useful knowledge about user behavior that can be applied to improve web service performance. Specific techniques covered include mining web logs to build path profiles that predict future page visits, using these predictions to prefetch web documents for faster loading, and clustering web pages to create more intuitive user interfaces. The document lists several applications of web log mining and its advantages.
1. Major Seminar
On
Knowledge Discovery from Web Logs
Guided By: Presented By:
Saurabh Anand Avtar kishore Gaur
Lecturer (IT/09/53)
Department Of IT VIII Sem, IT
Poornima College Of Engineering
Sitapura,Jaipur
2. Introduction
• Vast amount of Web site traversal information in the form
of Web logs are present.
• By analyzing these logs, it is possible to discover various
kinds of knowledge, which can be applied to improve the
performance of Web services.
• It is possible to learn the behavior of the Web users by
analyzing these logs.
3. Introduction
• A particularly kind of knowledge which can be immediately
applied to the operation of the Web site is called
Actionable knowledge.
• Mining of such knowledge is known as Knowledge
Discovery from Web Logs.
4. How big is the Web
• More then 4 billion websites are on Internet.(According to
alexa.com)
• At least 7.92 billion pages (Thursday, 23
February, 2012).(according to worldwidewebsize.com).
5. History
• Previous approaches was only aimed to mine Web-log
knowledge for human consumption.
• These days mining actionable knowledge from Web logs is
been used to improve the performance of Web Services.
6. Fields in Web Log File
• Reference Website www.hdwally.com Web Server: Apache
1. 66.249.71.6 - - [23/Feb/2012:06:23:46 -0600] "GET
/robots.txt HTTP/1.1" 500 7370 "-" "Mozilla/5.0
(compatible; Googlebot/2.1;
+http://www.google.com/bot.html)“
2. 180.76.5.92 - - [23/Feb/2012:06:11:04 -0600] "GET /
HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible;
Baiduspider/2.0;
+http://www.baidu.com/search/spider.html)“
• IP Adress:-66.249.71.6 and 180.76.5.92
• UserName:- -- and --
• Timestamp :- [23/Feb/2012:06:23:46 -0600] and -
[23/Feb/2012:06:11:04 -0600] (time of visit by webserver)
7. Fields in Web Log File
• Access request : "GET /robots.txt HTTP/1.1“ and "GET /
HTTP/1.1”
• Result status code : 500 and 500 (Internal Server Error)
• Bytes transferred : 7370 and 7370
• User Agent: Mozilla/5.0
• Referrer URL : (compatible; Googlebot/2.1;
+http://www.google.com/bot.html) and (compatible;
Baiduspider/2.0;
+http://www.baidu.com/search/spider.html)
9. Mining Web Logs for Path Profiles
• Data Cleaning on Web Log Data
• Mining Web Logs for Path Profiles
• Web Object Prediction
• Learning to Prefetch Web Documents
10. Data Cleaning on Web Log Data
• Break apart a long sequence of visits by the users into user
sessions.
• Identify user by an individual IP address.
• Thus, data cleaning means to separate the visiting
sequence of pages into visiting sessions.
11. Web Log Mining for Prefetching
• We have separate visiting sessions.
• Now we can develop path profiles from these sessions as
user visiting a sequence of Web pages often leaves a trail of
the pages URL’s in a Web log.
• A path profile consists frequent subsequences from the
frequently occurring paths.
• Path profile helps us to predict the next pages that are
most likely to occur.
12. Web Object Prediction
• it is possible to train a path-based model for predicting
future URL's based on a sequence of current URL accesses.
• This can be done on a per-user basis, or on a per-server
basis.
• The former requires that the user-session be recognized
and broken down nicely through a filtering system, and the
latter takes the simplistic view that the accesses on a server
is a single long thread.
13. Learning to Prefetch Web Documents
• Original cache memory is partitioned into two parts: cache-
buffer and prefetching-buffer.
• A prefetching agent(Script) keeps pre-loading the
prefetching-buffer with documents predicted to access
next.
14. Web Page Clustering for Intelligent
User Interfaces
• Web Logs can be used to build server-side customization
and transformation to make website more convenient for
users to visit and find their objectives.
• They path prediction algorithms that guess where the user
wants to go next in a browsing session like WebWatcher
and PageGather algorythm.
16. Advantages
• Its easy to implement.
• The companies can establish better customer relationship
by giving them exactly what they need.
• To create personalized search engines, which can
understand a person’s search queries in a personal way by
analyzing and profiling user’s search behavior.
• To improving caching and prefetching of Web objects.
• Use the mined knowledge for building better, adaptive user
interfaces.
• Applying Web query log knowledge to improving Web
search for a search engine application.
17. Reference
• Weblogs from www.hdwally.com and
www.hdwallpaper4u.com .
• www.jafsoft.com/searchengines/log_sample.html
• Research paper on Knowledge Discovery From Weblogs by
S Chandra and Dr B Kalpana.
• Researcalpana. paper on Mining Web Logs for Actionable
Knowledge by Qiang Yang, Charles X. Ling and Jianfeng Gao.
• http://www.galeas.de/webmining.html