2. Specifies
ī§ The WWW is huge, widely distributed, global
information service centre for
ī§ Information services:
news, advertisements, consumer
information, financial
management, education, government, e-
commerce, etc.
ī§ Hyper-link information
ī§ Access and usage information
ī§ WWW provides rich sources of data for data mining
3. The Web: Opportunities & Challenges
1. The amount of information on the Web is huge
2. The coverage of Web information is very wide and
diverse
3. Information/data of almost all types exist on the
Web
4. Much of the Web information is
semi-structured
5. Much of the Web information is linked
6. Much of the Web information is redundant
4. The Web: Opportunities & Challenges
7. The Web is noisy
8. The Web is also about services
9. The Web is dynamic
10. Above all, the Web is a virtual society
11. The Web consists of surface Web and deep Web.
ī§ Surface Web: pages that can be browsed using a
browser.
ī§ Deep Web: databases that can only be accessed
through parameterized query interfaces
5. What is Web Data ?
ī§ Web data is
1. Web content âtext,image,records,etc.
2. Web structure âhyperlinks,tags,etc.
3. Web usage âhttp logs,app server logs,etc.
4. Intra-page structures
5. Inter-page structures
6. Supplemental data
1. Profiles
2. Registration information
3. Cookies
6.
7. Web Mining
ī§ Web Mining is the use of the data mining techniques
to automatically discover and extract information
from web documents/services
ī§ Web mining is the application of data mining
techniques to find interesting and potentially useful
knowledge from web data
ī§ Web mining is the application of data mining
techniques to extract knowledge from web
data, including web documents, hyperlinks between
documents, usage logs of web sites, etc.
8. Web Mining
âĸ Web Mining is the use of the data mining techniques to
automatically discover and extract information from web
documents/services
âĸ Discovering useful information from the World-Wide
Web and its usage patterns
âĸ My Definition: Using data mining techniques to make the
web more useful and more profitable (for some) and to
increase the efficiency of our interaction with the web
9. Why Mine the Web?
ī´ Enormous wealth of information on Web
ī´ Financial information (e.g. stock quotes)
ī´ Book/CD/Video stores (e.g. Amazon)
ī´ Restaurant information
ī´ Car prices
ī´ Lots of data on user access patterns
ī´ Web logs contain sequence of URLs accessed by users
ī´ Possible to mine interesting nuggets of information
ī´ People who ski also travel frequently to Europe
ī´ Tech stocks have corrections in the summer and rally from November
until February
10. ī§ The Web is a huge collection of documents except for
ī§ Hyper-link information
ī§ Access and usage information
ī§ The Web is very dynamic
ī§ New pages are constantly being generated
ī§ Challenge: Develop new Web mining algorithms and adapt
traditional data mining algorithms to
ī§ Exploit hyper-links and access patterns
ī§ Be incremental
Why is Web Mining Different?
11. Web Mining: Subtasks
ī§ Resource finding
ī§ Retrieving intended documents
ī§ Information selection/pre-processing
ī§ Select and pre-process specific information from selected
documents
ī§ Generalization
ī§ Discover general patterns within and across web sites
ī§ Analysis
ī§ Validation and/or interpretation of mined patterns
12. Web Mining Issues
ī§ Size
ī§ Grows at about 1 million pages a day
ī§ Google indexes 9 billion documents
ī§ Number of web sites
ī§ Netcraft survey says 72 million sites
ī§ (http://news.netcraft.com/archives/web_server_survey.html)
ī§ Diverse types of data
ī§ Images
ī§ Text
ī§ Audio/video
ī§ XML
ī§ HTML
13. ī§ E-commerce (Infrastructure)
ī§ Generate user profiles
ī§ Targetted advertizing
ī§ Fraud
ī§ Similar image retrieval
ī§ Information retrieval (Search) on the Web
ī§ Automated generation of topic hierarchies
ī§ Web knowledge bases
ī§ Extraction of schema for XML documents
ī§ Network Management
ī§ Performance management
ī§ Fault management
Web Mining Applications
15. Web Data Mining
ī§ Use of data mining techniques to
automatically discover interesting and
potentially useful information from Web
documents and services.
ī§ Web mining may be divided into three
categories:
1. Web content mining
2. Web structure mining
3. Web usage mining
17. Web Content Mining
ī§ Discovery of useful information from web
contents / data / documents
ī§ Web data contents:
1. text,
2. image,
3. audio,
4. video,
5. metadata and
6. hyperlinks
18. Web Content Mining
ī§ Examine the contents of web pages as well as result of web
searching
ī§ Can be thought of as extending the work performed by basic
search engines
ī§ Search engines have crawlers to search the web and gather
information, indexing techniques to store the
information, and query processing support to provide
information to the users
ī§ Web Content Mining is: the process of extracting knowledge
from web contents
19. Web Content Mining
ī§ It provides no information about structure of
content that we are searching for and no
information about various categories of
documents that are found.
ī§ Need more sophisticated tools for searching or
discovering Web content.
20. Web Content mining
ī§ Discovering useful information from contents of Web
pages.
ī§ Web content is very rich consisting of
textual, image, audio, video etc and metadata as well
as hyperlinks.
ī§ The data may be unstructured (free text) or
structured (data from a database) or semi-structured
(html) although much of the Web is unstructured.
21. Web Content Data Structure
ī§ Unstructured â free text
ī§ Semi-structured â HTML
ī§ More structured â Table or Database generated
HTML pages
ī§ Multimedia data â receive less attention than text or
hypertext
22. Web Content mining
ī§ Web content mining is related to data mining
and text mining
ī§ It is related to data mining because many data
mining techniques can be applied in Web content
mining.
ī§ It is related to text mining because much of the
web contents are texts.
ī§ Web data are mainly semi-structured and/or
unstructured, while data mining is structured and
text is unstructured.
23. Web Content Data Structure
ī§ Web content consists of several types of data
ī§ Text, image, audio, video, hyperlinks.
ī§ Unstructured â free text
ī§ Semi-structured â HTML
ī§ More structured â Data in the tables or
database generated HTML pages
ī§ Note: much of the Web content data is unstructured
text data.
24. Semi-structured Data
ī§ Content is, in general, semi-structured
ī§ Example:
ī§ Title
ī§ Author
ī§ Publication_Date
ī§ Length
ī§ Category
ī§ Abstract
ī§ Content
25. Web Content Mining: IR View
ī§ Unstructured Documents
ī§ Bag of words, or phrase-based feature
representation
ī§ Features can be boolean or frequency based
ī§ Features can be reduced using different feature
selection techniques
ī§ Word stemming, combining morphological
variations into one feature
26. Web Content Mining: IR View
ī§ Semi-Structured Documents
ī§ Uses richer representations for features, based on
information from the document structure
(typically HTML and hyperlinks)
ī§ Uses common data mining methods (whereas
unstructured might use more text mining
methods)
27. Web Content Mining: DB View
ī§ Tries to infer the structure of a Web site or transform
a Web site to become a database
ī§ Better information management
ī§ Better querying on the Web
ī§ Can be achieved by:
ī§ Finding the schema of Web documents
ī§ Building a Web warehouse
ī§ Building a Web knowledge base
ī§ Building a virtual database
28. Web Content Mining: DB View
ī§ Mainly uses the Object Exchange Model (OEM)
ī§ Represents semi-structured data (some
structure, no rigid schema) by a labeled graph
ī§ Process typically starts with manual selection of Web
sites for content mining
ī§ Main application: building a structural summary of
semi-structured data (schema extraction or
discovery)
29. Tech for Web Content Mining
īClassifications
īClustering
īAssociation
30. Web Content Mining : Topics
ī§ Structured data extraction
ī§ Unstructured text extraction
ī§ Sentiment classification, analysis and summarization
of consumer reviews
ī§ Information integration and schema matching
ī§ Knowledge synthesis
ī§ Template detection and page segmentation
31. Structured Data Extraction
ī§ Most widely studied research topic
ī§ A large amount of information on the Web is
contained in regularly structured data objects
(retrieved from databases)Such Web data records are
important they often present the essential
information of their host pages, e.g., lists of products
and services
32. Structured Data Extraction
ī§ Applications: integrated and value-added
services, e.g., Comparative shopping, meta-search &
query, etc
34. Structured Data Extraction
:Approaches
ī§ Wrapper Generation
Write an extraction program for each website
based on observed format patterns
ī§ Labor intensive & time consuming
38. ī§ Automatic Approach
ī§ Structured data objects on the web are normally
database records
ī§ Retrieved from databases & displayed in web
pages with fixed templates
ī§ Find patterns / grammars from the web pages &
then use them to extract data
ī§ e. g. IEPAD, MDR, ROADRUNNER, EXALG etc
38
39. ī§ Wrapper Induction or Wrapper Learning
ī§ Main technique currently
ī§ The user first manually labels a set of trained
pages
ī§ A learning system then generates rules from the
training pages
ī§ The resulting rules are then applied to extract
target items from web pages
ī§ e.g. WIEN, Stalker, BWI, WL etc
39
40. ī§ Supervised Learning
ī§ Supervised learning is a âmachine learningâ technique for
creating a function from training data .
ī§ Documents are categorized
ī§ The output can predict a class label of the input object (called
classification).
ī§ Techniques used are
ī§ Nearest Neighbor Classifier
ī§ Feature Selection
ī§ Decision Tree
41. ī§ Removes terms in the training documents which
are statistically uncorrelated with the class labels
ī§ Simple heuristics
ī§ Stop words like âaâ, âanâ, âtheâ etc.
ī§ Empirically chosen thresholds for ignoring âtoo
frequentâ or âtoo rareâ terms
ī§ Discard âtoo frequentâ and âtoo rare termsâ
42. Examples of Discovered
Patterns
ī§ Association rules
ī§ 98% of AOL users also have E-trade accounts
ī§ Classification
ī§ People with age less than 40 and salary > 40k trade on-line
ī§ Clustering
ī§ Users A and B access similar URLs
ī§ Outlier Detection
ī§ User A spends more than twice the average amount of time
surfing on the Web
43. ī§ Important for improving customization
ī§ Provide users with pages, advertisements of interest
ī§ Example profiles: on-line trader, on-line shopper
ī§ Generate user profiles based on their access patterns
ī§ Cluster users based on frequently accessed URLs
ī§ Use classifier to generate a profile for each cluster
ī§ Engage technologies
ī§ Tracks web traffic to create anonymous user profiles of Web
surfers
ī§ Has profiles for more than 35 million anonymous users
44. ī§ Ads are a major source of revenue for Web
portals (e.g., Yahoo, Lycos) and E-commerce
sites
ī§ Plenty of startups doing internet advertizing
ī§ Doubleclick, AdForce, Flycast, AdKnowledge
ī§ Internet advertizing is probably the âhottestâ
web mining application today
45. ī§ Scheme 1:
ī§ Manually associate a set of ads with each user
profile
ī§ For each user, display an ad from the set based on
profile
ī§ Scheme 2:
ī§ Automate association between ads and users
ī§ Use ad click information to cluster users (each user
is associated with a set of ads that he/she clicked
on)
ī§ For each cluster, find ads that occur most frequently
in the cluster and these become the ads for the set
of users in the cluster
46. ī§ Use collaborative filtering (e.g. Likeminds, Firefly)
ī§ Each user Ui has a rating for a subset of ads (based
on click information, time spent, items bought etc.)
ī§ Rij - rating of user Ui for ad Aj
ī§ Problem: Compute user Uiâs rating for an unrated ad
Aj
A1 A2 A3
?
Internet Advertizing
47. ī§ Key Idea: User Uiâs rating for ad Aj is set to Rkj, where Uk
is the user whose rating of ads is most similar to Uiâs
ī§ User Uiâs rating for an ad Aj that has not been previously
displayed to Ui is computed as follows:
ī§ Consider a user Uk who has rated ad Aj
ī§ Compute Dik, the distance between Ui and Ukâs ratings on
common ads
ī§ Uiâs rating for ad Aj = Rkj (Uk is user with smallest Dik)
ī§ Display to Ui ad Aj with highest computed rating
Internet Advertizing
48. ī§ With the growing popularity of E-commerce, systems to
detect and prevent fraud on the Web become important
ī§ Maintain a signature for each user based on buying
patterns on the Web (e.g., amount spent, categories of
items bought)
ī§ If buying pattern changes significantly, then signal fraud
ī§ HNC software uses domain knowledge and neural
networks for credit card fraud detection
49. ī§ Given:
ī§ A set of images
ī§ Find:
ī§ All images similar to a given image
ī§ All pairs of similar images
ī§ Sample applications:
ī§ Medical diagnosis
ī§ Weather predication
ī§ Web search engine for images
ī§ E-commerce
50. ī§ QBIC, Virage, Photobook
ī§ Compute feature signature for each image
ī§ QBIC uses color histograms
ī§ WBIIS, WALRUS use wavelets
ī§ Use spatial index to retrieve database image whose
signature is closest to the queryâs signature
ī§ WALRUS decomposes an image into regions
ī§ A single signature is stored for each region
ī§ Two images are considered to be similar if they have
enough similar region pairs
52. ī§ Todayâs search engines are plagued by
problems:
ī§ the abundance problem (99% of info of no
interest to 99% of people)
ī§ limited coverage of the Web (internet
sources hidden behind search interfaces)
ī§ Largest crawlers cover < 18% of all web
pages
ī§ limited query interface based on keyword-
oriented search
ī§ limited customization to individual users
53. ī§ Todayâs search engines are plagued by
problems:
ī§ Web is highly dynamic
ī§ Lot of pages added, removed, and updated every
day
ī§ Very high dimensionality
54. ī§ Use Web directories (or topic hierarchies)
ī§ Provide a hierarchical classification of documents (e.g., Yahoo!)
ī§ Searches performed in the context of a topic restricts the search to only
a subset of web pages related to the topic
Recreation ScienceBusiness News
Yahoo home page
SportsTravel Companies Finance Jobs
55. ī§ In the Clever project, hyper-links between Web pages
are taken into account when categorizing them
ī§ Use a bayesian classifier
ī§ Exploit knowledge of the classes of immediate neighbors of
document to be classified
ī§ Show that simply taking text from neighbors and using
standard document classifiers to classify page does not work
ī§ Inktomiâs Directory Engine uses âConcept Inductionâ to
automatically categorize millions of documents
56. ī Objective: To deliver content to users quickly and
reliably
âĸ Traffic management
âĸ Fault management
Service Provider Network
Router
Server
57. ī§ While annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by
a factor of three
ī§ Result is frequent congestion at servers and on
network links
ī§ during a major event (e.g., princess dianaâs death), an
overwhelming number of user requests can result in millions
of redundant copies of data flowing back and forth across the
world
ī§ Olympic sites during the games
ī§ NASA sites close to launch and landing of shuttles
58. ī§ Key Ideas
ī§ Dynamically replicate/cache content at multiple sites within the
network and closer to the user
ī§ Multiple paths between any pair of sites
ī§ Route user requests to server closest to the user or least
loaded server
ī§ Use path with least congested network links
ī§ Akamai, Inktomi
60. ī§ Need to mine network and Web traffic to determine
ī§ What content to replicate?
ī§ Which servers should store replicas?
ī§ Which server to route a user request?
ī§ What path to use to route packets?
ī§ Network Design issues
ī§ Where to place servers?
ī§ Where to place routers?
ī§ Which routers should be connected by links?
ī§ One can use association rules, sequential pattern mining
algorithms to cache/prefetch replicas at server
61. ī§ Fault management involves
ī§ Quickly identifying failed/congested servers and links in network
ī§ Re-routing user requests and packets to avoid congested/down servers and
links
ī§ Need to analyze alarm and traffic data to carry out root cause analysis of
faults
ī§ Bayesian classifiers can be used to predict the root cause given a set of
alarms
63. ī§ Web data sets can be very large
ī§ Tens to hundreds of terabytes
ī§ Cannot mine on a single server!
ī§ Need large farms of servers
ī§ How to organize hardware/software to
mine multi-terabye data sets
ī§Without breaking the bank!
65. ī§ Pages contain information
ī§ Links are âroadsâ
ī§ How do people navigate the Internet
ī§ ī¨ Web Usage Mining (clickstream analysis)
ī§ Information on navigation paths
available in log files
ī§ Logs can be mined from a client or a
server perspective
66. ī§ Why analyze Website usage?
ī§ Knowledge about how visitors use Website could
ī§ Provide guidelines to web site reorganization; Help prevent
disorientation
ī§ Help designers place important information where the visitors
look for it
ī§ Pre-fetching and caching web pages
ī§ Provide adaptive Website (Personalization)
ī§ Questions which could be answered
ī§ What are the differences in usage and access patterns
among users?
ī§ What user behaviors change over time?
ī§ How usage patterns change with quality of service
(slow/fast)?
ī§ What is the distribution of network traffic over time?
67.
68.
69. ī§ Analog â Web Log File Analyser
ī§ Gives basic statistics such as
ī§ number of hits
ī§ average hits per time period
ī§ what are the popular pages in your site
ī§ who is visiting your site
ī§ what keywords are users searching for to get to
you
ī§ what is being downloaded
ī§ http://www.analog.cx/
70.
71.
72.
73. ī§ Content is, in general, semi-structured
ī§ Example:
ī§ Title
ī§ Author
ī§ Publication_Date
ī§ Length
ī§ Category
ī§ Abstract
ī§ Content
74. ī§ Many methods designed to analyze structured data
ī§ If we can represent documents by a set of attributes
we will be able to use existing data mining methods
ī§ How to represent a document?
ī§ Vector based representation(referred to as âbag of
wordsâ as it is invariant to permutations)
ī§ Use statistics to add a numerical dimension to
unstructured text
75. ī§ A document representation aims to capture what the
document is about
ī§ One possible approach:
ī§ Each entry describes a document
ī§ Attribute describe whether or not a term appears in the
document
76. ī§ Another approach:
ī§ Each entry describes a document
ī§ Attributes represent the frequency in
which a term appears in the document
77. ī§ Stop Word removal: Many words are not
informative and thus
ī§ Irrelevant for document representation the, and, a,
an, is, of, that, âĻ
ī§ Stemming: reducing words to their root form
(Reduce dimensionality)
ī§ A document may contain several occurrences of
words like fish, fishes, fisher, and fishers. But would
not be retrieved by a query with the keyword
âfishingâ
ī§ Different words share the same word stem and
should be represented with its stem, instead of the
actual word âFishâ