2. Web Mining is the use of the data mining
techniques to automatically discover and extract
information from web documents/services
Discovering useful information from the World-Wide
Web and its usage patterns
Using data mining techniques to make the web
more useful and more profitable (for some) and to
increase the efficiency of our interaction with the
web
3. Web usage mining is the process of extracting useful
information from server logs e.g. use Web usage
mining is the process of finding out what users are
looking for on the Internet. Some users might be
looking at only textual data, whereas some others
might be interested in multimedia data. Web Usage
Mining is the application of data mining techniques to
discover interesting usage patterns from Web data in
order to understand and better serve the needs of
Web-based applications.
4. Web Mining
Web content
mining
Web page
content mining
Search result
mining
Web structure
mining
Web usage
mining
General
access pattern
tracking
Customized
usage tracking
5. Data Mining Techniques
Association rules
Sequential patterns
Classification
Clustering
Outlier discovery
Applications to the Web
E-commerce
Information retrieval (search)
Network management
6. The WWW is huge, widely distributed, global
information service centre for
Information services: news, advertisements, consumer
information, financial management, education,
government, e-commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources of data for data mining
7. Enormous wealth of information on Web
Financial information (e.g. stock quotes)
Book/CD/Video stores (e.g. Amazon)
Restaurant information (e.g. Zagat's)
Car prices (e.g. CarPoint)
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information
People who ski also travel frequently to Europe
Tech stocks have corrections in the summer and rally from
November until February
8. The Web is a huge collection of documents except
for
Hyper-link information
Access and usage information
The Web is very dynamic
New pages are constantly being generated
Challenge: Develop new Web mining algorithms and
adapt traditional data mining algorithms to
Exploit hyper-links and access patterns
Be incremental
9. Given:
A source of textual documents
A well defined limited query (text based)
Find:
Sentences with relevant information
Extract the relevant information and
ignore non-relevant information (important!)
Link related information and output in a
predetermined format
10. Keyword (or term) based association analysis
automatic document (topic) classification
similarity detection
cluster documents by a common author
cluster documents containing information from a
common source
sequence analysis: predicting a recurring
event, discovering trends
anomaly detection: find information that
violates usual patterns
12. Creating a model of web organization
Classify web pages
Create similarity measures between web
pages
Page Rank
The Clever system
Hyperlink induced topic search(HITS)
13. Combine the intelligent IR tools
meaning of words
order of words in the query
user dependency for the data
authority of the source
With the unique web features
retrieve Hyper-link information
utilize Hyper-link as input
14. Program which browses WWW in a methodical,
automated manner
Copy in cache and do Indexing
Starts from a seed url
Searches and finds links, keywords
Types of Crawler
Context focused
Focused
Incremental
Periodic
15. Link analysis algorithm which assigns
numerical weight to a webpage.
The numerical weight that it assigns to any
given element E is also called the PageRank of
E and denoted by PR(E).
the PageRank value for a page u is dependent
on the PageRank values for each page v out of
the set Bu (this set contains all pages linking to
page u), divided by the number L(v) of links
from page v.
17. Finds both authoritative pages and
hubs
Authoritative - best source
Hub - link to authoritative pages
Most value page returned
Hyperlink Induced Topic Search
Keywords
Authority and hub measure
18. Applies mining on web usage data or weblogs
or clickstream data
Client perspective
Server perspective
Aid in personalization
Helps in evaluating quality and effectiveness
Preprocessing, pattern discovery and data
structures