SlideShare a Scribd company logo
1 of 77
Web-Content Mining
-Akanksha Dombe
JNEC, Aurangabad
Specifies
ī‚§ The WWW is huge, widely distributed, global
information service centre for
ī‚§ Information services:
news, advertisements, consumer
information, financial
management, education, government, e-
commerce, etc.
ī‚§ Hyper-link information
ī‚§ Access and usage information
ī‚§ WWW provides rich sources of data for data mining
The Web: Opportunities & Challenges
1. The amount of information on the Web is huge
2. The coverage of Web information is very wide and
diverse
3. Information/data of almost all types exist on the
Web
4. Much of the Web information is
semi-structured
5. Much of the Web information is linked
6. Much of the Web information is redundant
The Web: Opportunities & Challenges
7. The Web is noisy
8. The Web is also about services
9. The Web is dynamic
10. Above all, the Web is a virtual society
11. The Web consists of surface Web and deep Web.
ī‚§ Surface Web: pages that can be browsed using a
browser.
ī‚§ Deep Web: databases that can only be accessed
through parameterized query interfaces
What is Web Data ?
ī‚§ Web data is
1. Web content –text,image,records,etc.
2. Web structure –hyperlinks,tags,etc.
3. Web usage –http logs,app server logs,etc.
4. Intra-page structures
5. Inter-page structures
6. Supplemental data
1. Profiles
2. Registration information
3. Cookies
Web Mining
ī‚§ Web Mining is the use of the data mining techniques
to automatically discover and extract information
from web documents/services
ī‚§ Web mining is the application of data mining
techniques to find interesting and potentially useful
knowledge from web data
ī‚§ Web mining is the application of data mining
techniques to extract knowledge from web
data, including web documents, hyperlinks between
documents, usage logs of web sites, etc.
Web Mining
â€ĸ Web Mining is the use of the data mining techniques to
automatically discover and extract information from web
documents/services
â€ĸ Discovering useful information from the World-Wide
Web and its usage patterns
â€ĸ My Definition: Using data mining techniques to make the
web more useful and more profitable (for some) and to
increase the efficiency of our interaction with the web
Why Mine the Web?
ī´ Enormous wealth of information on Web
ī´ Financial information (e.g. stock quotes)
ī´ Book/CD/Video stores (e.g. Amazon)
ī´ Restaurant information
ī´ Car prices
ī´ Lots of data on user access patterns
ī´ Web logs contain sequence of URLs accessed by users
ī´ Possible to mine interesting nuggets of information
ī´ People who ski also travel frequently to Europe
ī´ Tech stocks have corrections in the summer and rally from November
until February
ī‚§ The Web is a huge collection of documents except for
ī‚§ Hyper-link information
ī‚§ Access and usage information
ī‚§ The Web is very dynamic
ī‚§ New pages are constantly being generated
ī‚§ Challenge: Develop new Web mining algorithms and adapt
traditional data mining algorithms to
ī‚§ Exploit hyper-links and access patterns
ī‚§ Be incremental
Why is Web Mining Different?
Web Mining: Subtasks
ī‚§ Resource finding
ī‚§ Retrieving intended documents
ī‚§ Information selection/pre-processing
ī‚§ Select and pre-process specific information from selected
documents
ī‚§ Generalization
ī‚§ Discover general patterns within and across web sites
ī‚§ Analysis
ī‚§ Validation and/or interpretation of mined patterns
Web Mining Issues
ī‚§ Size
ī‚§ Grows at about 1 million pages a day
ī‚§ Google indexes 9 billion documents
ī‚§ Number of web sites
ī‚§ Netcraft survey says 72 million sites
ī‚§ (http://news.netcraft.com/archives/web_server_survey.html)
ī‚§ Diverse types of data
ī‚§ Images
ī‚§ Text
ī‚§ Audio/video
ī‚§ XML
ī‚§ HTML
ī‚§ E-commerce (Infrastructure)
ī‚§ Generate user profiles
ī‚§ Targetted advertizing
ī‚§ Fraud
ī‚§ Similar image retrieval
ī‚§ Information retrieval (Search) on the Web
ī‚§ Automated generation of topic hierarchies
ī‚§ Web knowledge bases
ī‚§ Extraction of schema for XML documents
ī‚§ Network Management
ī‚§ Performance management
ī‚§ Fault management
Web Mining Applications
Web Mining Taxonomy
Web Data Mining
ī‚§ Use of data mining techniques to
automatically discover interesting and
potentially useful information from Web
documents and services.
ī‚§ Web mining may be divided into three
categories:
1. Web content mining
2. Web structure mining
3. Web usage mining
What
is
“Web Content mining?”
Web Content Mining
ī‚§ Discovery of useful information from web
contents / data / documents
ī‚§ Web data contents:
1. text,
2. image,
3. audio,
4. video,
5. metadata and
6. hyperlinks
Web Content Mining
ī‚§ Examine the contents of web pages as well as result of web
searching
ī‚§ Can be thought of as extending the work performed by basic
search engines
ī‚§ Search engines have crawlers to search the web and gather
information, indexing techniques to store the
information, and query processing support to provide
information to the users
ī‚§ Web Content Mining is: the process of extracting knowledge
from web contents
Web Content Mining
ī‚§ It provides no information about structure of
content that we are searching for and no
information about various categories of
documents that are found.
ī‚§ Need more sophisticated tools for searching or
discovering Web content.
Web Content mining
ī‚§ Discovering useful information from contents of Web
pages.
ī‚§ Web content is very rich consisting of
textual, image, audio, video etc and metadata as well
as hyperlinks.
ī‚§ The data may be unstructured (free text) or
structured (data from a database) or semi-structured
(html) although much of the Web is unstructured.
Web Content Data Structure
ī‚§ Unstructured – free text
ī‚§ Semi-structured – HTML
ī‚§ More structured – Table or Database generated
HTML pages
ī‚§ Multimedia data – receive less attention than text or
hypertext
Web Content mining
ī‚§ Web content mining is related to data mining
and text mining
ī‚§ It is related to data mining because many data
mining techniques can be applied in Web content
mining.
ī‚§ It is related to text mining because much of the
web contents are texts.
ī‚§ Web data are mainly semi-structured and/or
unstructured, while data mining is structured and
text is unstructured.
Web Content Data Structure
ī‚§ Web content consists of several types of data
ī‚§ Text, image, audio, video, hyperlinks.
ī‚§ Unstructured – free text
ī‚§ Semi-structured – HTML
ī‚§ More structured – Data in the tables or
database generated HTML pages
ī‚§ Note: much of the Web content data is unstructured
text data.
Semi-structured Data
ī‚§ Content is, in general, semi-structured
ī‚§ Example:
ī‚§ Title
ī‚§ Author
ī‚§ Publication_Date
ī‚§ Length
ī‚§ Category
ī‚§ Abstract
ī‚§ Content
Web Content Mining: IR View
ī‚§ Unstructured Documents
ī‚§ Bag of words, or phrase-based feature
representation
ī‚§ Features can be boolean or frequency based
ī‚§ Features can be reduced using different feature
selection techniques
ī‚§ Word stemming, combining morphological
variations into one feature
Web Content Mining: IR View
ī‚§ Semi-Structured Documents
ī‚§ Uses richer representations for features, based on
information from the document structure
(typically HTML and hyperlinks)
ī‚§ Uses common data mining methods (whereas
unstructured might use more text mining
methods)
Web Content Mining: DB View
ī‚§ Tries to infer the structure of a Web site or transform
a Web site to become a database
ī‚§ Better information management
ī‚§ Better querying on the Web
ī‚§ Can be achieved by:
ī‚§ Finding the schema of Web documents
ī‚§ Building a Web warehouse
ī‚§ Building a Web knowledge base
ī‚§ Building a virtual database
Web Content Mining: DB View
ī‚§ Mainly uses the Object Exchange Model (OEM)
ī‚§ Represents semi-structured data (some
structure, no rigid schema) by a labeled graph
ī‚§ Process typically starts with manual selection of Web
sites for content mining
ī‚§ Main application: building a structural summary of
semi-structured data (schema extraction or
discovery)
Tech for Web Content Mining
īƒ˜Classifications
īƒ˜Clustering
īƒ˜Association
Web Content Mining : Topics
ī‚§ Structured data extraction
ī‚§ Unstructured text extraction
ī‚§ Sentiment classification, analysis and summarization
of consumer reviews
ī‚§ Information integration and schema matching
ī‚§ Knowledge synthesis
ī‚§ Template detection and page segmentation
Structured Data Extraction
ī‚§ Most widely studied research topic
ī‚§ A large amount of information on the Web is
contained in regularly structured data objects
(retrieved from databases)Such Web data records are
important they often present the essential
information of their host pages, e.g., lists of products
and services
Structured Data Extraction
ī‚§ Applications: integrated and value-added
services, e.g., Comparative shopping, meta-search &
query, etc
Structured Data Extraction
:Approaches
1. Wrapper Generation
2. Wrapper Induction or Wrapper Learning
3. Automatic Approach
Structured Data Extraction
:Approaches
ī‚§ Wrapper Generation
Write an extraction program for each website
based on observed format patterns
ī‚§ Labor intensive & time consuming
35
36
CS511, Bing Liu, UIC37
ī‚§ Automatic Approach
ī‚§ Structured data objects on the web are normally
database records
ī‚§ Retrieved from databases & displayed in web
pages with fixed templates
ī‚§ Find patterns / grammars from the web pages &
then use them to extract data
ī‚§ e. g. IEPAD, MDR, ROADRUNNER, EXALG etc
38
ī‚§ Wrapper Induction or Wrapper Learning
ī‚§ Main technique currently
ī‚§ The user first manually labels a set of trained
pages
ī‚§ A learning system then generates rules from the
training pages
ī‚§ The resulting rules are then applied to extract
target items from web pages
ī‚§ e.g. WIEN, Stalker, BWI, WL etc
39
ī‚§ Supervised Learning
ī‚§ Supervised learning is a ‘machine learning’ technique for
creating a function from training data .
ī‚§ Documents are categorized
ī‚§ The output can predict a class label of the input object (called
classification).
ī‚§ Techniques used are
ī‚§ Nearest Neighbor Classifier
ī‚§ Feature Selection
ī‚§ Decision Tree
ī‚§ Removes terms in the training documents which
are statistically uncorrelated with the class labels
ī‚§ Simple heuristics
ī‚§ Stop words like “a”, “an”, “the” etc.
ī‚§ Empirically chosen thresholds for ignoring “too
frequent” or “too rare” terms
ī‚§ Discard “too frequent” and “too rare terms”
Examples of Discovered
Patterns
ī‚§ Association rules
ī‚§ 98% of AOL users also have E-trade accounts
ī‚§ Classification
ī‚§ People with age less than 40 and salary > 40k trade on-line
ī‚§ Clustering
ī‚§ Users A and B access similar URLs
ī‚§ Outlier Detection
ī‚§ User A spends more than twice the average amount of time
surfing on the Web
ī‚§ Important for improving customization
ī‚§ Provide users with pages, advertisements of interest
ī‚§ Example profiles: on-line trader, on-line shopper
ī‚§ Generate user profiles based on their access patterns
ī‚§ Cluster users based on frequently accessed URLs
ī‚§ Use classifier to generate a profile for each cluster
ī‚§ Engage technologies
ī‚§ Tracks web traffic to create anonymous user profiles of Web
surfers
ī‚§ Has profiles for more than 35 million anonymous users
ī‚§ Ads are a major source of revenue for Web
portals (e.g., Yahoo, Lycos) and E-commerce
sites
ī‚§ Plenty of startups doing internet advertizing
ī‚§ Doubleclick, AdForce, Flycast, AdKnowledge
ī‚§ Internet advertizing is probably the “hottest”
web mining application today
ī‚§ Scheme 1:
ī‚§ Manually associate a set of ads with each user
profile
ī‚§ For each user, display an ad from the set based on
profile
ī‚§ Scheme 2:
ī‚§ Automate association between ads and users
ī‚§ Use ad click information to cluster users (each user
is associated with a set of ads that he/she clicked
on)
ī‚§ For each cluster, find ads that occur most frequently
in the cluster and these become the ads for the set
of users in the cluster
ī‚§ Use collaborative filtering (e.g. Likeminds, Firefly)
ī‚§ Each user Ui has a rating for a subset of ads (based
on click information, time spent, items bought etc.)
ī‚§ Rij - rating of user Ui for ad Aj
ī‚§ Problem: Compute user Ui‟s rating for an unrated ad
Aj
A1 A2 A3
?
Internet Advertizing
ī‚§ Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Uk
is the user whose rating of ads is most similar to Ui‟s
ī‚§ User Ui‟s rating for an ad Aj that has not been previously
displayed to Ui is computed as follows:
ī‚§ Consider a user Uk who has rated ad Aj
ī‚§ Compute Dik, the distance between Ui and Uk‟s ratings on
common ads
ī‚§ Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik)
ī‚§ Display to Ui ad Aj with highest computed rating
Internet Advertizing
ī‚§ With the growing popularity of E-commerce, systems to
detect and prevent fraud on the Web become important
ī‚§ Maintain a signature for each user based on buying
patterns on the Web (e.g., amount spent, categories of
items bought)
ī‚§ If buying pattern changes significantly, then signal fraud
ī‚§ HNC software uses domain knowledge and neural
networks for credit card fraud detection
ī‚§ Given:
ī‚§ A set of images
ī‚§ Find:
ī‚§ All images similar to a given image
ī‚§ All pairs of similar images
ī‚§ Sample applications:
ī‚§ Medical diagnosis
ī‚§ Weather predication
ī‚§ Web search engine for images
ī‚§ E-commerce
ī‚§ QBIC, Virage, Photobook
ī‚§ Compute feature signature for each image
ī‚§ QBIC uses color histograms
ī‚§ WBIIS, WALRUS use wavelets
ī‚§ Use spatial index to retrieve database image whose
signature is closest to the query‟s signature
ī‚§ WALRUS decomposes an image into regions
ī‚§ A single signature is stored for each region
ī‚§ Two images are considered to be similar if they have
enough similar region pairs
Query image
ī‚§ Today‟s search engines are plagued by
problems:
ī‚§ the abundance problem (99% of info of no
interest to 99% of people)
ī‚§ limited coverage of the Web (internet
sources hidden behind search interfaces)
ī‚§ Largest crawlers cover < 18% of all web
pages
ī‚§ limited query interface based on keyword-
oriented search
ī‚§ limited customization to individual users
ī‚§ Today‟s search engines are plagued by
problems:
ī‚§ Web is highly dynamic
ī‚§ Lot of pages added, removed, and updated every
day
ī‚§ Very high dimensionality
ī‚§ Use Web directories (or topic hierarchies)
ī‚§ Provide a hierarchical classification of documents (e.g., Yahoo!)
ī‚§ Searches performed in the context of a topic restricts the search to only
a subset of web pages related to the topic
Recreation ScienceBusiness News
Yahoo home page
SportsTravel Companies Finance Jobs
ī‚§ In the Clever project, hyper-links between Web pages
are taken into account when categorizing them
ī‚§ Use a bayesian classifier
ī‚§ Exploit knowledge of the classes of immediate neighbors of
document to be classified
ī‚§ Show that simply taking text from neighbors and using
standard document classifiers to classify page does not work
ī‚§ Inktomi‟s Directory Engine uses “Concept Induction” to
automatically categorize millions of documents
ī‚ž Objective: To deliver content to users quickly and
reliably
â€ĸ Traffic management
â€ĸ Fault management
Service Provider Network
Router
Server
ī‚§ While annual bandwidth demand is increasing ten-fold
on average, annual bandwidth supply is rising only by
a factor of three
ī‚§ Result is frequent congestion at servers and on
network links
ī‚§ during a major event (e.g., princess diana‟s death), an
overwhelming number of user requests can result in millions
of redundant copies of data flowing back and forth across the
world
ī‚§ Olympic sites during the games
ī‚§ NASA sites close to launch and landing of shuttles
ī‚§ Key Ideas
ī‚§ Dynamically replicate/cache content at multiple sites within the
network and closer to the user
ī‚§ Multiple paths between any pair of sites
ī‚§ Route user requests to server closest to the user or least
loaded server
ī‚§ Use path with least congested network links
ī‚§ Akamai, Inktomi
Service Provider Network
Router
Server
Request
Congested
server
Congested
link
ī‚§ Need to mine network and Web traffic to determine
ī‚§ What content to replicate?
ī‚§ Which servers should store replicas?
ī‚§ Which server to route a user request?
ī‚§ What path to use to route packets?
ī‚§ Network Design issues
ī‚§ Where to place servers?
ī‚§ Where to place routers?
ī‚§ Which routers should be connected by links?
ī‚§ One can use association rules, sequential pattern mining
algorithms to cache/prefetch replicas at server
ī‚§ Fault management involves
ī‚§ Quickly identifying failed/congested servers and links in network
ī‚§ Re-routing user requests and packets to avoid congested/down servers and
links
ī‚§ Need to analyze alarm and traffic data to carry out root cause analysis of
faults
ī‚§ Bayesian classifiers can be used to predict the root cause given a set of
alarms
Total Sites Across All Domains August 1995 - October 2007
ī‚§ Web data sets can be very large
ī‚§ Tens to hundreds of terabytes
ī‚§ Cannot mine on a single server!
ī‚§ Need large farms of servers
ī‚§ How to organize hardware/software to
mine multi-terabye data sets
ī‚§Without breaking the bank!
ī‚§ Structured Data
ī‚§ Unstructured Data
ī‚§ OLE DB offers some solutions!
ī‚§ Pages contain information
ī‚§ Links are „roads‟
ī‚§ How do people navigate the Internet
ī‚§ īƒ¨ Web Usage Mining (clickstream analysis)
ī‚§ Information on navigation paths
available in log files
ī‚§ Logs can be mined from a client or a
server perspective
ī‚§ Why analyze Website usage?
ī‚§ Knowledge about how visitors use Website could
ī‚§ Provide guidelines to web site reorganization; Help prevent
disorientation
ī‚§ Help designers place important information where the visitors
look for it
ī‚§ Pre-fetching and caching web pages
ī‚§ Provide adaptive Website (Personalization)
ī‚§ Questions which could be answered
ī‚§ What are the differences in usage and access patterns
among users?
ī‚§ What user behaviors change over time?
ī‚§ How usage patterns change with quality of service
(slow/fast)?
ī‚§ What is the distribution of network traffic over time?
ī‚§ Analog – Web Log File Analyser
ī‚§ Gives basic statistics such as
ī‚§ number of hits
ī‚§ average hits per time period
ī‚§ what are the popular pages in your site
ī‚§ who is visiting your site
ī‚§ what keywords are users searching for to get to
you
ī‚§ what is being downloaded
ī‚§ http://www.analog.cx/
ī‚§ Content is, in general, semi-structured
ī‚§ Example:
ī‚§ Title
ī‚§ Author
ī‚§ Publication_Date
ī‚§ Length
ī‚§ Category
ī‚§ Abstract
ī‚§ Content
ī‚§ Many methods designed to analyze structured data
ī‚§ If we can represent documents by a set of attributes
we will be able to use existing data mining methods
ī‚§ How to represent a document?
ī‚§ Vector based representation(referred to as “bag of
words” as it is invariant to permutations)
ī‚§ Use statistics to add a numerical dimension to
unstructured text
ī‚§ A document representation aims to capture what the
document is about
ī‚§ One possible approach:
ī‚§ Each entry describes a document
ī‚§ Attribute describe whether or not a term appears in the
document
ī‚§ Another approach:
ī‚§ Each entry describes a document
ī‚§ Attributes represent the frequency in
which a term appears in the document
ī‚§ Stop Word removal: Many words are not
informative and thus
ī‚§ Irrelevant for document representation the, and, a,
an, is, of, that, â€Ļ
ī‚§ Stemming: reducing words to their root form
(Reduce dimensionality)
ī‚§ A document may contain several occurrences of
words like fish, fishes, fisher, and fishers. But would
not be retrieved by a query with the keyword
“fishing”
ī‚§ Different words share the same word stem and
should be represented with its stem, instead of the
actual word “Fish”

More Related Content

What's hot

Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)ajaybabu1314
 
Web Mining
Web MiningWeb Mining
Web MiningZiyad Abid
 
Components of a search engine
Components of a search engineComponents of a search engine
Components of a search enginePrimya Tamil
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDataminingTools Inc
 
Web Mining
Web Mining Web Mining
Web Mining guestb73ec6
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalA. LE
 
Web usage mining
Web usage miningWeb usage mining
Web usage miningMonu Chaudhary
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawlerishmecse13
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introductionnimmyjans4
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 
How search engines work
How search engines workHow search engines work
How search engines workChinna Botla
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningDataminingTools Inc
 

What's hot (20)

Web mining (1)
Web mining (1)Web mining (1)
Web mining (1)
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Components of a search engine
Components of a search engineComponents of a search engine
Components of a search engine
 
Web mining
Web miningWeb mining
Web mining
 
Web mining
Web miningWeb mining
Web mining
 
Web spam
Web spamWeb spam
Web spam
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Automatic indexing
Automatic indexingAutomatic indexing
Automatic indexing
 
Web Mining
Web Mining Web Mining
Web Mining
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Introduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information RetrievalIntroduction into Search Engines and Information Retrieval
Introduction into Search Engines and Information Retrieval
 
Temporal data mining
Temporal data miningTemporal data mining
Temporal data mining
 
Web usage mining
Web usage miningWeb usage mining
Web usage mining
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Information retrieval introduction
Information retrieval introductionInformation retrieval introduction
Information retrieval introduction
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
How search engines work
How search engines workHow search engines work
How search engines work
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 

Similar to Web content mining

Aa03401490154
Aa03401490154Aa03401490154
Aa03401490154ijceronline
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information ExtractionScott Bou
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDatamining Tools
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Web Mining
Web MiningWeb Mining
Web MiningShobha Rani
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web miningGaurav Uniyal
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web MiningIOSR Journals
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataMelinda Watson
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 
Web mining and social media mining
Web mining and social media miningWeb mining and social media mining
Web mining and social media miningRoxana Tadayon
 

Similar to Web content mining (20)

5463 26 web mining
5463 26 web mining5463 26 web mining
5463 26 web mining
 
Aa03401490154
Aa03401490154Aa03401490154
Aa03401490154
 
A Study Web Data Mining Challenges And Application For Information Extraction
A Study  Web Data Mining Challenges And Application For Information ExtractionA Study  Web Data Mining Challenges And Application For Information Extraction
A Study Web Data Mining Challenges And Application For Information Extraction
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
Bb31269380
Bb31269380Bb31269380
Bb31269380
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 
Web mining and social media mining
Web mining and social media miningWeb mining and social media mining
Web mining and social media mining
 

Recently uploaded

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Web content mining

  • 2. Specifies ī‚§ The WWW is huge, widely distributed, global information service centre for ī‚§ Information services: news, advertisements, consumer information, financial management, education, government, e- commerce, etc. ī‚§ Hyper-link information ī‚§ Access and usage information ī‚§ WWW provides rich sources of data for data mining
  • 3. The Web: Opportunities & Challenges 1. The amount of information on the Web is huge 2. The coverage of Web information is very wide and diverse 3. Information/data of almost all types exist on the Web 4. Much of the Web information is semi-structured 5. Much of the Web information is linked 6. Much of the Web information is redundant
  • 4. The Web: Opportunities & Challenges 7. The Web is noisy 8. The Web is also about services 9. The Web is dynamic 10. Above all, the Web is a virtual society 11. The Web consists of surface Web and deep Web. ī‚§ Surface Web: pages that can be browsed using a browser. ī‚§ Deep Web: databases that can only be accessed through parameterized query interfaces
  • 5. What is Web Data ? ī‚§ Web data is 1. Web content –text,image,records,etc. 2. Web structure –hyperlinks,tags,etc. 3. Web usage –http logs,app server logs,etc. 4. Intra-page structures 5. Inter-page structures 6. Supplemental data 1. Profiles 2. Registration information 3. Cookies
  • 6.
  • 7. Web Mining ī‚§ Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services ī‚§ Web mining is the application of data mining techniques to find interesting and potentially useful knowledge from web data ī‚§ Web mining is the application of data mining techniques to extract knowledge from web data, including web documents, hyperlinks between documents, usage logs of web sites, etc.
  • 8. Web Mining â€ĸ Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services â€ĸ Discovering useful information from the World-Wide Web and its usage patterns â€ĸ My Definition: Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web
  • 9. Why Mine the Web? ī´ Enormous wealth of information on Web ī´ Financial information (e.g. stock quotes) ī´ Book/CD/Video stores (e.g. Amazon) ī´ Restaurant information ī´ Car prices ī´ Lots of data on user access patterns ī´ Web logs contain sequence of URLs accessed by users ī´ Possible to mine interesting nuggets of information ī´ People who ski also travel frequently to Europe ī´ Tech stocks have corrections in the summer and rally from November until February
  • 10. ī‚§ The Web is a huge collection of documents except for ī‚§ Hyper-link information ī‚§ Access and usage information ī‚§ The Web is very dynamic ī‚§ New pages are constantly being generated ī‚§ Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to ī‚§ Exploit hyper-links and access patterns ī‚§ Be incremental Why is Web Mining Different?
  • 11. Web Mining: Subtasks ī‚§ Resource finding ī‚§ Retrieving intended documents ī‚§ Information selection/pre-processing ī‚§ Select and pre-process specific information from selected documents ī‚§ Generalization ī‚§ Discover general patterns within and across web sites ī‚§ Analysis ī‚§ Validation and/or interpretation of mined patterns
  • 12. Web Mining Issues ī‚§ Size ī‚§ Grows at about 1 million pages a day ī‚§ Google indexes 9 billion documents ī‚§ Number of web sites ī‚§ Netcraft survey says 72 million sites ī‚§ (http://news.netcraft.com/archives/web_server_survey.html) ī‚§ Diverse types of data ī‚§ Images ī‚§ Text ī‚§ Audio/video ī‚§ XML ī‚§ HTML
  • 13. ī‚§ E-commerce (Infrastructure) ī‚§ Generate user profiles ī‚§ Targetted advertizing ī‚§ Fraud ī‚§ Similar image retrieval ī‚§ Information retrieval (Search) on the Web ī‚§ Automated generation of topic hierarchies ī‚§ Web knowledge bases ī‚§ Extraction of schema for XML documents ī‚§ Network Management ī‚§ Performance management ī‚§ Fault management Web Mining Applications
  • 15. Web Data Mining ī‚§ Use of data mining techniques to automatically discover interesting and potentially useful information from Web documents and services. ī‚§ Web mining may be divided into three categories: 1. Web content mining 2. Web structure mining 3. Web usage mining
  • 17. Web Content Mining ī‚§ Discovery of useful information from web contents / data / documents ī‚§ Web data contents: 1. text, 2. image, 3. audio, 4. video, 5. metadata and 6. hyperlinks
  • 18. Web Content Mining ī‚§ Examine the contents of web pages as well as result of web searching ī‚§ Can be thought of as extending the work performed by basic search engines ī‚§ Search engines have crawlers to search the web and gather information, indexing techniques to store the information, and query processing support to provide information to the users ī‚§ Web Content Mining is: the process of extracting knowledge from web contents
  • 19. Web Content Mining ī‚§ It provides no information about structure of content that we are searching for and no information about various categories of documents that are found. ī‚§ Need more sophisticated tools for searching or discovering Web content.
  • 20. Web Content mining ī‚§ Discovering useful information from contents of Web pages. ī‚§ Web content is very rich consisting of textual, image, audio, video etc and metadata as well as hyperlinks. ī‚§ The data may be unstructured (free text) or structured (data from a database) or semi-structured (html) although much of the Web is unstructured.
  • 21. Web Content Data Structure ī‚§ Unstructured – free text ī‚§ Semi-structured – HTML ī‚§ More structured – Table or Database generated HTML pages ī‚§ Multimedia data – receive less attention than text or hypertext
  • 22. Web Content mining ī‚§ Web content mining is related to data mining and text mining ī‚§ It is related to data mining because many data mining techniques can be applied in Web content mining. ī‚§ It is related to text mining because much of the web contents are texts. ī‚§ Web data are mainly semi-structured and/or unstructured, while data mining is structured and text is unstructured.
  • 23. Web Content Data Structure ī‚§ Web content consists of several types of data ī‚§ Text, image, audio, video, hyperlinks. ī‚§ Unstructured – free text ī‚§ Semi-structured – HTML ī‚§ More structured – Data in the tables or database generated HTML pages ī‚§ Note: much of the Web content data is unstructured text data.
  • 24. Semi-structured Data ī‚§ Content is, in general, semi-structured ī‚§ Example: ī‚§ Title ī‚§ Author ī‚§ Publication_Date ī‚§ Length ī‚§ Category ī‚§ Abstract ī‚§ Content
  • 25. Web Content Mining: IR View ī‚§ Unstructured Documents ī‚§ Bag of words, or phrase-based feature representation ī‚§ Features can be boolean or frequency based ī‚§ Features can be reduced using different feature selection techniques ī‚§ Word stemming, combining morphological variations into one feature
  • 26. Web Content Mining: IR View ī‚§ Semi-Structured Documents ī‚§ Uses richer representations for features, based on information from the document structure (typically HTML and hyperlinks) ī‚§ Uses common data mining methods (whereas unstructured might use more text mining methods)
  • 27. Web Content Mining: DB View ī‚§ Tries to infer the structure of a Web site or transform a Web site to become a database ī‚§ Better information management ī‚§ Better querying on the Web ī‚§ Can be achieved by: ī‚§ Finding the schema of Web documents ī‚§ Building a Web warehouse ī‚§ Building a Web knowledge base ī‚§ Building a virtual database
  • 28. Web Content Mining: DB View ī‚§ Mainly uses the Object Exchange Model (OEM) ī‚§ Represents semi-structured data (some structure, no rigid schema) by a labeled graph ī‚§ Process typically starts with manual selection of Web sites for content mining ī‚§ Main application: building a structural summary of semi-structured data (schema extraction or discovery)
  • 29. Tech for Web Content Mining īƒ˜Classifications īƒ˜Clustering īƒ˜Association
  • 30. Web Content Mining : Topics ī‚§ Structured data extraction ī‚§ Unstructured text extraction ī‚§ Sentiment classification, analysis and summarization of consumer reviews ī‚§ Information integration and schema matching ī‚§ Knowledge synthesis ī‚§ Template detection and page segmentation
  • 31. Structured Data Extraction ī‚§ Most widely studied research topic ī‚§ A large amount of information on the Web is contained in regularly structured data objects (retrieved from databases)Such Web data records are important they often present the essential information of their host pages, e.g., lists of products and services
  • 32. Structured Data Extraction ī‚§ Applications: integrated and value-added services, e.g., Comparative shopping, meta-search & query, etc
  • 33. Structured Data Extraction :Approaches 1. Wrapper Generation 2. Wrapper Induction or Wrapper Learning 3. Automatic Approach
  • 34. Structured Data Extraction :Approaches ī‚§ Wrapper Generation Write an extraction program for each website based on observed format patterns ī‚§ Labor intensive & time consuming
  • 35. 35
  • 36. 36
  • 38. ī‚§ Automatic Approach ī‚§ Structured data objects on the web are normally database records ī‚§ Retrieved from databases & displayed in web pages with fixed templates ī‚§ Find patterns / grammars from the web pages & then use them to extract data ī‚§ e. g. IEPAD, MDR, ROADRUNNER, EXALG etc 38
  • 39. ī‚§ Wrapper Induction or Wrapper Learning ī‚§ Main technique currently ī‚§ The user first manually labels a set of trained pages ī‚§ A learning system then generates rules from the training pages ī‚§ The resulting rules are then applied to extract target items from web pages ī‚§ e.g. WIEN, Stalker, BWI, WL etc 39
  • 40. ī‚§ Supervised Learning ī‚§ Supervised learning is a ‘machine learning’ technique for creating a function from training data . ī‚§ Documents are categorized ī‚§ The output can predict a class label of the input object (called classification). ī‚§ Techniques used are ī‚§ Nearest Neighbor Classifier ī‚§ Feature Selection ī‚§ Decision Tree
  • 41. ī‚§ Removes terms in the training documents which are statistically uncorrelated with the class labels ī‚§ Simple heuristics ī‚§ Stop words like “a”, “an”, “the” etc. ī‚§ Empirically chosen thresholds for ignoring “too frequent” or “too rare” terms ī‚§ Discard “too frequent” and “too rare terms”
  • 42. Examples of Discovered Patterns ī‚§ Association rules ī‚§ 98% of AOL users also have E-trade accounts ī‚§ Classification ī‚§ People with age less than 40 and salary > 40k trade on-line ī‚§ Clustering ī‚§ Users A and B access similar URLs ī‚§ Outlier Detection ī‚§ User A spends more than twice the average amount of time surfing on the Web
  • 43. ī‚§ Important for improving customization ī‚§ Provide users with pages, advertisements of interest ī‚§ Example profiles: on-line trader, on-line shopper ī‚§ Generate user profiles based on their access patterns ī‚§ Cluster users based on frequently accessed URLs ī‚§ Use classifier to generate a profile for each cluster ī‚§ Engage technologies ī‚§ Tracks web traffic to create anonymous user profiles of Web surfers ī‚§ Has profiles for more than 35 million anonymous users
  • 44. ī‚§ Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites ī‚§ Plenty of startups doing internet advertizing ī‚§ Doubleclick, AdForce, Flycast, AdKnowledge ī‚§ Internet advertizing is probably the “hottest” web mining application today
  • 45. ī‚§ Scheme 1: ī‚§ Manually associate a set of ads with each user profile ī‚§ For each user, display an ad from the set based on profile ī‚§ Scheme 2: ī‚§ Automate association between ads and users ī‚§ Use ad click information to cluster users (each user is associated with a set of ads that he/she clicked on) ī‚§ For each cluster, find ads that occur most frequently in the cluster and these become the ads for the set of users in the cluster
  • 46. ī‚§ Use collaborative filtering (e.g. Likeminds, Firefly) ī‚§ Each user Ui has a rating for a subset of ads (based on click information, time spent, items bought etc.) ī‚§ Rij - rating of user Ui for ad Aj ī‚§ Problem: Compute user Ui‟s rating for an unrated ad Aj A1 A2 A3 ? Internet Advertizing
  • 47. ī‚§ Key Idea: User Ui‟s rating for ad Aj is set to Rkj, where Uk is the user whose rating of ads is most similar to Ui‟s ī‚§ User Ui‟s rating for an ad Aj that has not been previously displayed to Ui is computed as follows: ī‚§ Consider a user Uk who has rated ad Aj ī‚§ Compute Dik, the distance between Ui and Uk‟s ratings on common ads ī‚§ Ui‟s rating for ad Aj = Rkj (Uk is user with smallest Dik) ī‚§ Display to Ui ad Aj with highest computed rating Internet Advertizing
  • 48. ī‚§ With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important ī‚§ Maintain a signature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought) ī‚§ If buying pattern changes significantly, then signal fraud ī‚§ HNC software uses domain knowledge and neural networks for credit card fraud detection
  • 49. ī‚§ Given: ī‚§ A set of images ī‚§ Find: ī‚§ All images similar to a given image ī‚§ All pairs of similar images ī‚§ Sample applications: ī‚§ Medical diagnosis ī‚§ Weather predication ī‚§ Web search engine for images ī‚§ E-commerce
  • 50. ī‚§ QBIC, Virage, Photobook ī‚§ Compute feature signature for each image ī‚§ QBIC uses color histograms ī‚§ WBIIS, WALRUS use wavelets ī‚§ Use spatial index to retrieve database image whose signature is closest to the query‟s signature ī‚§ WALRUS decomposes an image into regions ī‚§ A single signature is stored for each region ī‚§ Two images are considered to be similar if they have enough similar region pairs
  • 52. ī‚§ Today‟s search engines are plagued by problems: ī‚§ the abundance problem (99% of info of no interest to 99% of people) ī‚§ limited coverage of the Web (internet sources hidden behind search interfaces) ī‚§ Largest crawlers cover < 18% of all web pages ī‚§ limited query interface based on keyword- oriented search ī‚§ limited customization to individual users
  • 53. ī‚§ Today‟s search engines are plagued by problems: ī‚§ Web is highly dynamic ī‚§ Lot of pages added, removed, and updated every day ī‚§ Very high dimensionality
  • 54. ī‚§ Use Web directories (or topic hierarchies) ī‚§ Provide a hierarchical classification of documents (e.g., Yahoo!) ī‚§ Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic Recreation ScienceBusiness News Yahoo home page SportsTravel Companies Finance Jobs
  • 55. ī‚§ In the Clever project, hyper-links between Web pages are taken into account when categorizing them ī‚§ Use a bayesian classifier ī‚§ Exploit knowledge of the classes of immediate neighbors of document to be classified ī‚§ Show that simply taking text from neighbors and using standard document classifiers to classify page does not work ī‚§ Inktomi‟s Directory Engine uses “Concept Induction” to automatically categorize millions of documents
  • 56. ī‚ž Objective: To deliver content to users quickly and reliably â€ĸ Traffic management â€ĸ Fault management Service Provider Network Router Server
  • 57. ī‚§ While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three ī‚§ Result is frequent congestion at servers and on network links ī‚§ during a major event (e.g., princess diana‟s death), an overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world ī‚§ Olympic sites during the games ī‚§ NASA sites close to launch and landing of shuttles
  • 58. ī‚§ Key Ideas ī‚§ Dynamically replicate/cache content at multiple sites within the network and closer to the user ī‚§ Multiple paths between any pair of sites ī‚§ Route user requests to server closest to the user or least loaded server ī‚§ Use path with least congested network links ī‚§ Akamai, Inktomi
  • 60. ī‚§ Need to mine network and Web traffic to determine ī‚§ What content to replicate? ī‚§ Which servers should store replicas? ī‚§ Which server to route a user request? ī‚§ What path to use to route packets? ī‚§ Network Design issues ī‚§ Where to place servers? ī‚§ Where to place routers? ī‚§ Which routers should be connected by links? ī‚§ One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server
  • 61. ī‚§ Fault management involves ī‚§ Quickly identifying failed/congested servers and links in network ī‚§ Re-routing user requests and packets to avoid congested/down servers and links ī‚§ Need to analyze alarm and traffic data to carry out root cause analysis of faults ī‚§ Bayesian classifiers can be used to predict the root cause given a set of alarms
  • 62. Total Sites Across All Domains August 1995 - October 2007
  • 63. ī‚§ Web data sets can be very large ī‚§ Tens to hundreds of terabytes ī‚§ Cannot mine on a single server! ī‚§ Need large farms of servers ī‚§ How to organize hardware/software to mine multi-terabye data sets ī‚§Without breaking the bank!
  • 64. ī‚§ Structured Data ī‚§ Unstructured Data ī‚§ OLE DB offers some solutions!
  • 65. ī‚§ Pages contain information ī‚§ Links are „roads‟ ī‚§ How do people navigate the Internet ī‚§ īƒ¨ Web Usage Mining (clickstream analysis) ī‚§ Information on navigation paths available in log files ī‚§ Logs can be mined from a client or a server perspective
  • 66. ī‚§ Why analyze Website usage? ī‚§ Knowledge about how visitors use Website could ī‚§ Provide guidelines to web site reorganization; Help prevent disorientation ī‚§ Help designers place important information where the visitors look for it ī‚§ Pre-fetching and caching web pages ī‚§ Provide adaptive Website (Personalization) ī‚§ Questions which could be answered ī‚§ What are the differences in usage and access patterns among users? ī‚§ What user behaviors change over time? ī‚§ How usage patterns change with quality of service (slow/fast)? ī‚§ What is the distribution of network traffic over time?
  • 67.
  • 68.
  • 69. ī‚§ Analog – Web Log File Analyser ī‚§ Gives basic statistics such as ī‚§ number of hits ī‚§ average hits per time period ī‚§ what are the popular pages in your site ī‚§ who is visiting your site ī‚§ what keywords are users searching for to get to you ī‚§ what is being downloaded ī‚§ http://www.analog.cx/
  • 70.
  • 71.
  • 72.
  • 73. ī‚§ Content is, in general, semi-structured ī‚§ Example: ī‚§ Title ī‚§ Author ī‚§ Publication_Date ī‚§ Length ī‚§ Category ī‚§ Abstract ī‚§ Content
  • 74. ī‚§ Many methods designed to analyze structured data ī‚§ If we can represent documents by a set of attributes we will be able to use existing data mining methods ī‚§ How to represent a document? ī‚§ Vector based representation(referred to as “bag of words” as it is invariant to permutations) ī‚§ Use statistics to add a numerical dimension to unstructured text
  • 75. ī‚§ A document representation aims to capture what the document is about ī‚§ One possible approach: ī‚§ Each entry describes a document ī‚§ Attribute describe whether or not a term appears in the document
  • 76. ī‚§ Another approach: ī‚§ Each entry describes a document ī‚§ Attributes represent the frequency in which a term appears in the document
  • 77. ī‚§ Stop Word removal: Many words are not informative and thus ī‚§ Irrelevant for document representation the, and, a, an, is, of, that, â€Ļ ī‚§ Stemming: reducing words to their root form (Reduce dimensionality) ī‚§ A document may contain several occurrences of words like fish, fishes, fisher, and fishers. But would not be retrieved by a query with the keyword “fishing” ī‚§ Different words share the same word stem and should be represented with its stem, instead of the actual word “Fish”