SlideShare uma empresa Scribd logo
1 de 19
Major Seminar
                             On
        Knowledge Discovery from Web Logs




Guided By:                                       Presented By:
Saurabh Anand                                    Avtar kishore Gaur
Lecturer                                         (IT/09/53)
Department Of IT                                 VIII Sem, IT

                   Poornima College Of Engineering
                          Sitapura,Jaipur
Introduction
• Vast amount of Web site traversal information in the form
  of Web logs are present.
• By analyzing these logs, it is possible to discover various
  kinds of knowledge, which can be applied to improve the
  performance of Web services.
• It is possible to learn the behavior of the Web users by
  analyzing these logs.
Introduction
• A particularly kind of knowledge which can be immediately
  applied to the operation of the Web site is called
  Actionable knowledge.
• Mining of such knowledge is known as Knowledge
  Discovery from Web Logs.
How big is the Web
• More then 4 billion websites are on Internet.(According to
  alexa.com)

• At least 7.92 billion pages (Thursday, 23
  February, 2012).(according to worldwidewebsize.com).
History
• Previous approaches was only aimed to mine Web-log
  knowledge for human consumption.
• These days mining actionable knowledge from Web logs is
  been used to improve the performance of Web Services.
Fields in Web Log File
• Reference Website www.hdwally.com Web Server: Apache
         1. 66.249.71.6 - - [23/Feb/2012:06:23:46 -0600] "GET
           /robots.txt HTTP/1.1" 500 7370 "-" "Mozilla/5.0
           (compatible; Googlebot/2.1;
           +http://www.google.com/bot.html)“
         2. 180.76.5.92 - - [23/Feb/2012:06:11:04 -0600] "GET /
           HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible;
           Baiduspider/2.0;
           +http://www.baidu.com/search/spider.html)“
• IP Adress:-66.249.71.6 and 180.76.5.92
• UserName:- -- and --
• Timestamp :- [23/Feb/2012:06:23:46 -0600] and -
  [23/Feb/2012:06:11:04 -0600] (time of visit by webserver)
Fields in Web Log File
• Access request : "GET /robots.txt HTTP/1.1“ and "GET /
  HTTP/1.1”
• Result status code : 500 and 500 (Internal Server Error)
• Bytes transferred : 7370 and 7370
• User Agent: Mozilla/5.0
• Referrer URL : (compatible; Googlebot/2.1;
  +http://www.google.com/bot.html) and (compatible;
  Baiduspider/2.0;
  +http://www.baidu.com/search/spider.html)
Example Of a Web Log File
• fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400]
  "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST-
  WebCrawler/2.1-pre2 (ashen@looksmart.net)"
  fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400]
  "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST-
  WebCrawler/2.1-pre2 (ashen@looksmart.net)“
• 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET
  /pics/wpaper.gif HTTP/1.0" 200 6248
  "http://www.jafsoft.com/asctortf/"   "Mozilla/4.05
  (Macintosh; I; PPC )"
Mining Web Logs for Path Profiles
•   Data Cleaning on Web Log Data
•   Mining Web Logs for Path Profiles
•   Web Object Prediction
•   Learning to Prefetch Web Documents
Data Cleaning on Web Log Data
• Break apart a long sequence of visits by the users into user
  sessions.
• Identify user by an individual IP address.
• Thus, data cleaning means to separate the visiting
  sequence of pages into visiting sessions.
Web Log Mining for Prefetching
• We have separate visiting sessions.
• Now we can develop path profiles from these sessions as
  user visiting a sequence of Web pages often leaves a trail of
  the pages URL’s in a Web log.
• A path profile consists frequent subsequences from the
  frequently occurring paths.
• Path profile helps us to predict the next pages that are
  most likely to occur.
Web Object Prediction
• it is possible to train a path-based model for predicting
  future URL's based on a sequence of current URL accesses.
• This can be done on a per-user basis, or on a per-server
  basis.
• The former requires that the user-session be recognized
  and broken down nicely through a filtering system, and the
  latter takes the simplistic view that the accesses on a server
  is a single long thread.
Learning to Prefetch Web Documents
• Original cache memory is partitioned into two parts: cache-
  buffer and prefetching-buffer.
• A prefetching agent(Script) keeps pre-loading the
  prefetching-buffer with documents predicted to access
  next.
Web Page Clustering for Intelligent
              User Interfaces
• Web Logs can be used to build server-side customization
  and transformation to make website more convenient for
  users to visit and find their objectives.
• They path prediction algorithms that guess where the user
  wants to go next in a browsing session like WebWatcher
  and PageGather algorythm.
Applications
•    Search Engines
•    Similarity Measures
•    Ontology
•   information aggregation
•    Recognition technology
•    Summarization
•    E-commerce
•    Content management
Advantages
• Its easy to implement.
• The companies can establish better customer relationship
  by giving them exactly what they need.
• To create personalized search engines, which can
  understand a person’s search queries in a personal way by
  analyzing and profiling user’s search behavior.
• To improving caching and prefetching of Web objects.
• Use the mined knowledge for building better, adaptive user
  interfaces.
• Applying Web query log knowledge to improving Web
  search for a search engine application.
Reference
• Weblogs from www.hdwally.com and
  www.hdwallpaper4u.com .
• www.jafsoft.com/searchengines/log_sample.html
• Research paper on Knowledge Discovery From Weblogs by
  S Chandra and Dr B Kalpana.
• Researcalpana. paper on Mining Web Logs for Actionable
  Knowledge by Qiang Yang, Charles X. Ling and Jianfeng Gao.
• http://www.galeas.de/webmining.html
Queries ?
Thanks

Mais conteúdo relacionado

Mais procurados

Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
Mayur Garg
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
Vikram Parmar
 

Mais procurados (20)

Smart crawler a two stage crawler
Smart crawler a two stage crawlerSmart crawler a two stage crawler
Smart crawler a two stage crawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
Web crawler synopsis
Web crawler synopsisWeb crawler synopsis
Web crawler synopsis
 
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
SMART CRAWLER: A TWO-STAGE CRAWLER FOR EFFICIENTLY HARVESTING DEEP-WEB INTERF...
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
What is a web crawler and how does it work
What is a web crawler and how does it workWhat is a web crawler and how does it work
What is a web crawler and how does it work
 
Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
SemaGrow demonstrator: “Web Crawler + AgroTagger”
SemaGrow demonstrator: “Web Crawler + AgroTagger”SemaGrow demonstrator: “Web Crawler + AgroTagger”
SemaGrow demonstrator: “Web Crawler + AgroTagger”
 
Re-usable metadata, re-usable content
Re-usable metadata, re-usable contentRe-usable metadata, re-usable content
Re-usable metadata, re-usable content
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Web crawler and applications
Web crawler and applicationsWeb crawler and applications
Web crawler and applications
 
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Guía SEO 2020: Trucos y recomendaciones para desarrolladores y webmasters
Guía SEO 2020: Trucos y recomendaciones para desarrolladores y webmastersGuía SEO 2020: Trucos y recomendaciones para desarrolladores y webmasters
Guía SEO 2020: Trucos y recomendaciones para desarrolladores y webmasters
 
Scout xss csrf_security_presentation_chicago
Scout xss csrf_security_presentation_chicagoScout xss csrf_security_presentation_chicago
Scout xss csrf_security_presentation_chicago
 
Colloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web CrawlerColloquim Report - Rotto Link Web Crawler
Colloquim Report - Rotto Link Web Crawler
 
Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Web crawler with seo analysis
Web crawler with seo analysis Web crawler with seo analysis
Web crawler with seo analysis
 

Semelhante a Avtar's ppt

Semelhante a Avtar's ppt (20)

HIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPagesHIGWGET-A Model for Crawling Secure Hidden WebPages
HIGWGET-A Model for Crawling Secure Hidden WebPages
 
By
ByBy
By
 
Web Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage miningWeb Data mining-A Research area in Web usage mining
Web Data mining-A Research area in Web usage mining
 
SharePoint Development Workshop
SharePoint Development WorkshopSharePoint Development Workshop
SharePoint Development Workshop
 
Bb31269380
Bb31269380Bb31269380
Bb31269380
 
Preprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage MiningPreprocessing of Web Log Data for Web Usage Mining
Preprocessing of Web Log Data for Web Usage Mining
 
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtuali...
 
Web Crawler For Mining Web Data
Web Crawler For Mining Web DataWeb Crawler For Mining Web Data
Web Crawler For Mining Web Data
 
Web performance
Web performanceWeb performance
Web performance
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Door Of Internet
Door Of InternetDoor Of Internet
Door Of Internet
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
Web Performance Optimization
Web Performance OptimizationWeb Performance Optimization
Web Performance Optimization
 
WebApp / SPA @ AllFacebook Developer Conference
WebApp / SPA @ AllFacebook Developer ConferenceWebApp / SPA @ AllFacebook Developer Conference
WebApp / SPA @ AllFacebook Developer Conference
 
The Technical SEO Full Course how to do
The Technical SEO  Full Course  how to doThe Technical SEO  Full Course  how to do
The Technical SEO Full Course how to do
 
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
A Public Cloud Based SOA Workflow for Machine Learning Based Recommendation A...
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Web personalization using clustering of web usage data
Web personalization using clustering of web usage dataWeb personalization using clustering of web usage data
Web personalization using clustering of web usage data
 
Smart Crawler Automation with RMI
Smart Crawler Automation with RMISmart Crawler Automation with RMI
Smart Crawler Automation with RMI
 

Avtar's ppt

  • 1. Major Seminar On Knowledge Discovery from Web Logs Guided By: Presented By: Saurabh Anand Avtar kishore Gaur Lecturer (IT/09/53) Department Of IT VIII Sem, IT Poornima College Of Engineering Sitapura,Jaipur
  • 2. Introduction • Vast amount of Web site traversal information in the form of Web logs are present. • By analyzing these logs, it is possible to discover various kinds of knowledge, which can be applied to improve the performance of Web services. • It is possible to learn the behavior of the Web users by analyzing these logs.
  • 3. Introduction • A particularly kind of knowledge which can be immediately applied to the operation of the Web site is called Actionable knowledge. • Mining of such knowledge is known as Knowledge Discovery from Web Logs.
  • 4. How big is the Web • More then 4 billion websites are on Internet.(According to alexa.com) • At least 7.92 billion pages (Thursday, 23 February, 2012).(according to worldwidewebsize.com).
  • 5. History • Previous approaches was only aimed to mine Web-log knowledge for human consumption. • These days mining actionable knowledge from Web logs is been used to improve the performance of Web Services.
  • 6. Fields in Web Log File • Reference Website www.hdwally.com Web Server: Apache 1. 66.249.71.6 - - [23/Feb/2012:06:23:46 -0600] "GET /robots.txt HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)“ 2. 180.76.5.92 - - [23/Feb/2012:06:11:04 -0600] "GET / HTTP/1.1" 500 7370 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)“ • IP Adress:-66.249.71.6 and 180.76.5.92 • UserName:- -- and -- • Timestamp :- [23/Feb/2012:06:23:46 -0600] and - [23/Feb/2012:06:11:04 -0600] (time of visit by webserver)
  • 7. Fields in Web Log File • Access request : "GET /robots.txt HTTP/1.1“ and "GET / HTTP/1.1” • Result status code : 500 and 500 (Internal Server Error) • Bytes transferred : 7370 and 7370 • User Agent: Mozilla/5.0 • Referrer URL : (compatible; Googlebot/2.1; +http://www.google.com/bot.html) and (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
  • 8. Example Of a Web Log File • fcrawler.looksmart.com - - [26/Apr/2000:00:00:12 -0400] "GET /contacts.html HTTP/1.0" 200 4595 "-" "FAST- WebCrawler/2.1-pre2 (ashen@looksmart.net)" fcrawler.looksmart.com - - [26/Apr/2000:00:17:19 -0400] "GET /news/news.html HTTP/1.0" 200 16716 "-" "FAST- WebCrawler/2.1-pre2 (ashen@looksmart.net)“ • 123.123.123.123 - - [26/Apr/2000:00:23:48 -0400] "GET /pics/wpaper.gif HTTP/1.0" 200 6248 "http://www.jafsoft.com/asctortf/" "Mozilla/4.05 (Macintosh; I; PPC )"
  • 9. Mining Web Logs for Path Profiles • Data Cleaning on Web Log Data • Mining Web Logs for Path Profiles • Web Object Prediction • Learning to Prefetch Web Documents
  • 10. Data Cleaning on Web Log Data • Break apart a long sequence of visits by the users into user sessions. • Identify user by an individual IP address. • Thus, data cleaning means to separate the visiting sequence of pages into visiting sessions.
  • 11. Web Log Mining for Prefetching • We have separate visiting sessions. • Now we can develop path profiles from these sessions as user visiting a sequence of Web pages often leaves a trail of the pages URL’s in a Web log. • A path profile consists frequent subsequences from the frequently occurring paths. • Path profile helps us to predict the next pages that are most likely to occur.
  • 12. Web Object Prediction • it is possible to train a path-based model for predicting future URL's based on a sequence of current URL accesses. • This can be done on a per-user basis, or on a per-server basis. • The former requires that the user-session be recognized and broken down nicely through a filtering system, and the latter takes the simplistic view that the accesses on a server is a single long thread.
  • 13. Learning to Prefetch Web Documents • Original cache memory is partitioned into two parts: cache- buffer and prefetching-buffer. • A prefetching agent(Script) keeps pre-loading the prefetching-buffer with documents predicted to access next.
  • 14. Web Page Clustering for Intelligent User Interfaces • Web Logs can be used to build server-side customization and transformation to make website more convenient for users to visit and find their objectives. • They path prediction algorithms that guess where the user wants to go next in a browsing session like WebWatcher and PageGather algorythm.
  • 15. Applications • Search Engines • Similarity Measures • Ontology • information aggregation • Recognition technology • Summarization • E-commerce • Content management
  • 16. Advantages • Its easy to implement. • The companies can establish better customer relationship by giving them exactly what they need. • To create personalized search engines, which can understand a person’s search queries in a personal way by analyzing and profiling user’s search behavior. • To improving caching and prefetching of Web objects. • Use the mined knowledge for building better, adaptive user interfaces. • Applying Web query log knowledge to improving Web search for a search engine application.
  • 17. Reference • Weblogs from www.hdwally.com and www.hdwallpaper4u.com . • www.jafsoft.com/searchengines/log_sample.html • Research paper on Knowledge Discovery From Weblogs by S Chandra and Dr B Kalpana. • Researcalpana. paper on Mining Web Logs for Actionable Knowledge by Qiang Yang, Charles X. Ling and Jianfeng Gao. • http://www.galeas.de/webmining.html