Web content mining

Web-Content Mining
-Akanksha Dombe
JNEC, Aurangabad

Specifies
 The WWW is huge, widely distributed, global
information service centre for
 Information services:
news, advertisements, consumer
information, financial
management, education, government, e-
commerce, etc.
 Hyper-link information
 Access and usage information
 WWW provides rich sources of data for data mining

The Web: Opportunities & Challenges
1. The amount of information on the Web is huge
2. The coverage of Web information is very wide and
diverse
3. Information/data of almost all types exist on the
Web
4. Much of the Web information is
semi-structured
5. Much of the Web information is linked
6. Much of the Web information is redundant

The Web: Opportunities & Challenges
7. The Web is noisy
8. The Web is also about services
9. The Web is dynamic
10. Above all, the Web is a virtual society
11. The Web consists of surface Web and deep Web.
 Surface Web: pages that can be browsed using a
browser.
 Deep Web: databases that can only be accessed
through parameterized query interfaces

What is Web Data ?
 Web data is
1. Web content –text,image,records,etc.
2. Web structure –hyperlinks,tags,etc.
3. Web usage –http logs,app server logs,etc.
4. Intra-page structures
5. Inter-page structures
6. Supplemental data
1. Profiles
2. Registration information
3. Cookies

Web Mining
 Web Mining is the use of the data mining techniques
to automatically discover and extract information
from web documents/services
 Web mining is the application of data mining
techniques to find interesting and potentially useful
knowledge from web data
 Web mining is the application of data mining
techniques to extract knowledge from web
data, including web documents, hyperlinks between
documents, usage logs of web sites, etc.

Web Mining
• Web Mining is the use of the data mining techniques to
automatically discover and extract information from web
documents/services
• Discovering useful information from the World-Wide
Web and its usage patterns
• My Definition: Using data mining techniques to make the
web more useful and more profitable (for some) and to
increase the efficiency of our interaction with the web

Why Mine the Web?
 Enormous wealth of information on Web
 Financial information (e.g. stock quotes)
 Book/CD/Video stores (e.g. Amazon)
 Restaurant information
 Car prices
 Lots of data on user access patterns
 Web logs contain sequence of URLs accessed by users
 Possible to mine interesting nuggets of information
 People who ski also travel frequently to Europe
 Tech stocks have corrections in the summer and rally from November
until February

 The Web is a huge collection of documents except for
 Hyper-link information
 Access and usage information
 The Web is very dynamic
 New pages are constantly being generated
 Challenge: Develop new Web mining algorithms and adapt
traditional data mining algorithms to
 Exploit hyper-links and access patterns
 Be incremental
Why is Web Mining Different?

Web Mining: Subtasks
 Resource finding
 Retrieving intended documents
 Information selection/pre-processing
 Select and pre-process specific information from selected
documents
 Generalization
 Discover general patterns within and across web sites
 Analysis
 Validation and/or interpretation of mined patterns

Web Mining Issues
 Size
 Grows at about 1 million pages a day
 Google indexes 9 billion documents
 Number of web sites
 Netcraft survey says 72 million sites
 (http://news.netcraft.com/archives/web_server_survey.html)
 Diverse types of data
 Images
 Text
 Audio/video
 XML
 HTML

 E-commerce (Infrastructure)
 Generate user profiles
 Targetted advertizing
 Fraud
 Similar image retrieval
 Information retrieval (Search) on the Web
 Automated generation of topic hierarchies
 Web knowledge bases
 Extraction of schema for XML documents
 Network Management
 Performance management
 Fault management
Web Mining Applications

Web Data Mining
 Use of data mining techniques to
automatically discover interesting and
potentially useful information from Web
documents and services.
 Web mining may be divided into three
categories:
1. Web content mining
2. Web structure mining
3. Web usage mining

What
is
“Web Content mining?”

Web Content Mining
 Discovery of useful information from web
contents / data / documents
 Web data contents:
1. text,
2. image,
3. audio,
4. video,
5. metadata and
6. hyperlinks

Web Content Mining
 Examine the contents of web pages as well as result of web
searching
 Can be thought of as extending the work performed by basic
search engines
 Search engines have crawlers to search the web and gather
information, indexing techniques to store the
information, and query processing support to provide
information to the users
 Web Content Mining is: the process of extracting knowledge
from web contents

Web Content Mining
 It provides no information about structure of
content that we are searching for and no
information about various categories of
documents that are found.
 Need more sophisticated tools for searching or
discovering Web content.

Web Content mining
 Discovering useful information from contents of Web
pages.
 Web content is very rich consisting of
textual, image, audio, video etc and metadata as well
as hyperlinks.
 The data may be unstructured (free text) or
structured (data from a database) or semi-structured
(html) although much of the Web is unstructured.

Web Content Data Structure
 Unstructured – free text
 Semi-structured – HTML
 More structured – Table or Database generated
HTML pages
 Multimedia data – receive less attention than text or
hypertext

Web Content mining
 Web content mining is related to data mining
and text mining
 It is related to data mining because many data
mining techniques can be applied in Web content
mining.
 It is related to text mining because much of the
web contents are texts.
 Web data are mainly semi-structured and/or
unstructured, while data mining is structured and
text is unstructured.

Web Content Data Structure
 Web content consists of several types of data
 Text, image, audio, video, hyperlinks.
 Unstructured – free text
 Semi-structured – HTML
 More structured – Data in the tables or
database generated HTML pages
 Note: much of the Web content data is unstructured
text data.

Semi-structured Data
 Content is, in general, semi-structured
 Example:
 Title
 Author
 Publication_Date
 Length
 Category
 Abstract
 Content

Web Content Mining: IR View
 Unstructured Documents
 Bag of words, or phrase-based feature
representation
 Features can be boolean or frequency based
 Features can be reduced using different feature
selection techniques
 Word stemming, combining morphological
variations into one feature

Web Content Mining: IR View
 Semi-Structured Documents
 Uses richer representations for features, based on
information from the document structure
(typically HTML and hyperlinks)
 Uses common data mining methods (whereas
unstructured might use more text mining
methods)

Web Content Mining: DB View
 Tries to infer the structure of a Web site or transform
a Web site to become a database
 Better information management
 Better querying on the Web
 Can be achieved by:
 Finding the schema of Web documents
 Building a Web warehouse
 Building a Web knowledge base
 Building a virtual database

Web Content Mining: DB View
 Mainly uses the Object Exchange Model (OEM)
 Represents semi-structured data (some
structure, no rigid schema) by a labeled graph
 Process typically starts with manual selection of Web
sites for content mining
 Main application: building a structural summary of
semi-structured data (schema extraction or
discovery)

Tech for Web Content Mining
Classifications
Clustering
Association

Web Content Mining : Topics
 Structured data extraction
 Unstructured text extraction
 Sentiment classification, analysis and summarization
of consumer reviews
 Information integration and schema matching
 Knowledge synthesis
 Template detection and page segmentation

Structured Data Extraction
 Most widely studied research topic
 A large amount of information on the Web is
contained in regularly structured data objects
(retrieved from databases)Such Web data records are
important they often present the essential
information of their host pages, e.g., lists of products
and services

 Applications: integrated and value-added
services, e.g., Comparative shopping, meta-search &
query, etc

:Approaches
1. Wrapper Generation
2. Wrapper Induction or Wrapper Learning
3. Automatic Approach

:Approaches
 Wrapper Generation
Write an extraction program for each website
based on observed format patterns
 Labor intensive & time consuming

 Automatic Approach
 Structured data objects on the web are normally
database records
 Retrieved from databases & displayed in web
pages with fixed templates
 Find patterns / grammars from the web pages &
then use them to extract data
 e. g. IEPAD, MDR, ROADRUNNER, EXALG etc
38

 Wrapper Induction or Wrapper Learning
 Main technique currently
 The user first manually labels a set of trained
pages
 A learning system then generates rules from the
training pages
 The resulting rules are then applied to extract
target items from web pages
 e.g. WIEN, Stalker, BWI, WL etc
39

 Supervised Learning
 Supervised learning is a ‘machine learning’ technique for
creating a function from training data .
 Documents are categorized
 The output can predict a class label of the input object (called
classification).
 Techniques used are
 Nearest Neighbor Classifier
 Feature Selection
 Decision Tree

 Removes terms in the training documents which
are statistically uncorrelated with the class labels
 Simple heuristics
 Stop words like “a”, “an”, “the” etc.
 Empirically chosen thresholds for ignoring “too
frequent” or “too rare” terms
 Discard “too frequent” and “too rare terms”

Examples of Discovered
Patterns
 Association rules
 98% of AOL users also have E-trade accounts
 Classification
 People with age less than 40 and salary > 40k trade on-line
 Clustering
 Users A and B access similar URLs
 Outlier Detection
 User A spends more than twice the average amount of time
surfing on the Web

 Important for improving customization
 Provide users with pages, advertisements of interest
 Example profiles: on-line trader, on-line shopper
 Generate user profiles based on their access patterns
 Cluster users based on frequently accessed URLs
 Use classifier to generate a profile for each cluster
 Engage technologies
 Tracks web traffic to create anonymous user profiles of Web
surfers
 Has profiles for more than 35 million anonymous users

 Ads are a major source of revenue for Web
portals (e.g., Yahoo, Lycos) and E-commerce
sites
 Plenty of startups doing internet advertizing
 Doubleclick, AdForce, Flycast, AdKnowledge
 Internet advertizing is probably the “hottest”
web mining application today

 Scheme 1:
 Manually associate a set of ads with each user
profile
 For each user, display an ad from the set based on
profile
 Scheme 2:
 Automate association between ads and users
 Use ad click information to cluster users (each user
is associated with a set of ads that he/she clicked
on)
 For each cluster, find ads that occur most frequently
in the cluster and these become the ads for the set
of users in the cluster

 Use collaborative filtering (e.g. Likeminds, Firefly)
 Each user Ui has a rating for a subset of ads (based
on click information, time spent, items bought etc.)
 Rij - rating of user Ui for ad Aj
 Problem: Compute user Ui‟s rating for an unrated ad
Aj
A1 A2 A3
?
Internet Advertizing

 Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Uk
is the user whose rating of ads is most similar to Ui‟s
 User Ui‟s rating for an ad Aj that has not been previously
displayed to Ui is computed as follows:
 Consider a user Uk who has rated ad Aj
 Compute Dik, the distance between Ui and Uk‟s ratings on
common ads
 Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik)
 Display to Ui ad Aj with highest computed rating
Internet Advertizing

 With the growing popularity of E-commerce, systems to
detect and prevent fraud on the Web become important
 Maintain a signature for each user based on buying
patterns on the Web (e.g., amount spent, categories of
items bought)
 If buying pattern changes significantly, then signal fraud
 HNC software uses domain knowledge and neural
networks for credit card fraud detection

 Given:
 A set of images
 Find:
 All images similar to a given image
 All pairs of similar images
 Sample applications:
 Medical diagnosis
 Weather predication
 Web search engine for images
 E-commerce

 QBIC, Virage, Photobook
 Compute feature signature for each image
 QBIC uses color histograms
 WBIIS, WALRUS use wavelets
 Use spatial index to retrieve database image whose
signature is closest to the query‟s signature
 WALRUS decomposes an image into regions
 A single signature is stored for each region
 Two images are considered to be similar if they have
enough similar region pairs

 Today‟s search engines are plagued by
problems:
 the abundance problem (99% of info of no
interest to 99% of people)
 limited coverage of the Web (internet
sources hidden behind search interfaces)
 Largest crawlers cover < 18% of all web
pages
 limited query interface based on keyword-
oriented search
 limited customization to individual users

 Today‟s search engines are plagued by
problems:
 Web is highly dynamic
 Lot of pages added, removed, and updated every
day
 Very high dimensionality

 Use Web directories (or topic hierarchies)
 Provide a hierarchical classification of documents (e.g., Yahoo!)
 Searches performed in the context of a topic restricts the search to only
a subset of web pages related to the topic
Recreation ScienceBusiness News
Yahoo home page
SportsTravel Companies Finance Jobs

 In the Clever project, hyper-links between Web pages
are taken into account when categorizing them
 Use a bayesian classifier
 Exploit knowledge of the classes of immediate neighbors of
document to be classified
 Show that simply taking text from neighbors and using
standard document classifiers to classify page does not work
 Inktomi‟s Directory Engine uses “Concept Induction” to
automatically categorize millions of documents

 Objective: To deliver content to users quickly and
reliably
• Traffic management
• Fault management
Service Provider Network
Router
Server

 While annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by
a factor of three
 Result is frequent congestion at servers and on
network links
 during a major event (e.g., princess diana‟s death), an
overwhelming number of user requests can result in millions
of redundant copies of data flowing back and forth across the
world
 Olympic sites during the games
 NASA sites close to launch and landing of shuttles

 Key Ideas
 Dynamically replicate/cache content at multiple sites within the
network and closer to the user
 Multiple paths between any pair of sites
 Route user requests to server closest to the user or least
loaded server
 Use path with least congested network links
 Akamai, Inktomi

Service Provider Network
Router
Server
Request
Congested
server
Congested
link

 Need to mine network and Web traffic to determine
 What content to replicate?
 Which servers should store replicas?
 Which server to route a user request?
 What path to use to route packets?
 Network Design issues
 Where to place servers?
 Where to place routers?
 Which routers should be connected by links?
 One can use association rules, sequential pattern mining
algorithms to cache/prefetch replicas at server

 Fault management involves
 Quickly identifying failed/congested servers and links in network
 Re-routing user requests and packets to avoid congested/down servers and
links
 Need to analyze alarm and traffic data to carry out root cause analysis of
faults
 Bayesian classifiers can be used to predict the root cause given a set of
alarms

Total Sites Across All Domains August 1995 - October 2007

 Web data sets can be very large
 Tens to hundreds of terabytes
 Cannot mine on a single server!
 Need large farms of servers
 How to organize hardware/software to
mine multi-terabye data sets
Without breaking the bank!

 Structured Data
 Unstructured Data
 OLE DB offers some solutions!

 Pages contain information
 Links are „roads‟
 How do people navigate the Internet
  Web Usage Mining (clickstream analysis)
 Information on navigation paths
available in log files
 Logs can be mined from a client or a
server perspective

 Why analyze Website usage?
 Knowledge about how visitors use Website could
 Provide guidelines to web site reorganization; Help prevent
disorientation
 Help designers place important information where the visitors
look for it
 Pre-fetching and caching web pages
 Provide adaptive Website (Personalization)
 Questions which could be answered
 What are the differences in usage and access patterns
among users?
 What user behaviors change over time?
 How usage patterns change with quality of service
(slow/fast)?
 What is the distribution of network traffic over time?

 Analog – Web Log File Analyser
 Gives basic statistics such as
 number of hits
 average hits per time period
 what are the popular pages in your site
 who is visiting your site
 what keywords are users searching for to get to
you
 what is being downloaded
 http://www.analog.cx/

 Content is, in general, semi-structured
 Example:
 Title
 Author
 Publication_Date
 Length
 Category
 Abstract
 Content

 Many methods designed to analyze structured data
 If we can represent documents by a set of attributes
we will be able to use existing data mining methods
 How to represent a document?
 Vector based representation(referred to as “bag of
words” as it is invariant to permutations)
 Use statistics to add a numerical dimension to
unstructured text

 A document representation aims to capture what the
document is about
 One possible approach:
 Each entry describes a document
 Attribute describe whether or not a term appears in the
document

 Another approach:
 Each entry describes a document
 Attributes represent the frequency in
which a term appears in the document

 Stop Word removal: Many words are not
informative and thus
 Irrelevant for document representation the, and, a,
an, is, of, that, …
 Stemming: reducing words to their root form
(Reduce dimensionality)
 A document may contain several occurrences of
words like fish, fishes, fisher, and fishers. But would
not be retrieved by a query with the keyword
“fishing”
 Different words share the same word stem and
should be represented with its stem, instead of the
actual word “Fish”

Web content mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web content mining

Similar to Web content mining (20)

Recently uploaded

Recently uploaded (20)

Web content mining