SlideShare uma empresa Scribd logo
1 de 41
Data Mining Chapter 5
Web Data Mining

Introduction to Data Mining with Case Studies
Author: G. K. Gupta
Prentice Hall India, 2006.
Web Data Mining
Use of data mining techniques to automatically
discover interesting and potentially useful
information from Web documents and services.
Web mining may be divided into three categories:
1. Web content mining
2. Web structure mining
3. Web usage mining

December 2008

©GKGupta

2
Web details
• More than 20 billion pages in 2008
• Many more documents in databases accessible from
the Web
• More than 4m servers
• A total of perhaps 100 terabytes
• More than a million pages are added daily
• Several hundred gigabytes change every month
• Hyperlinks for navigation, endorsement, citation,
criticism or plain whim

December 2008

©GKGupta

3
Web Index Size in Pages
Number of
Pages in
Billions

Google

16

MSN
Search

7

Yahoo

50

Ask

Total Web is estimated to be
about 23B pages.

Search
Engine

4.2

Source
http://www.worldwidewebsize.com/
20 November 2008

December 2008

©GKGupta

4
Size of the Web
GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista
September 2003

December 2008

©GKGupta

5
Size Trends
GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista
September 2003

December 2008

©GKGupta

6
Web
• Some 80% of Web pages are in English
• About 30% of domains are in .com domain

December 2008

©GKGupta

7
Graph terminology
•Web is a graph – vertices and edges (V,E)
•Directed graph – directed edges (p,q)
•Undirected graph - undirected edges (p,q)
•Strongly connected component - a set of nodes such
that for any (u,v) there is a path from u to v
•Breadth first search
•Diameter of a graph
•Average distance of the graph

December 2008

©GKGupta

8
Graph terminology
•Breadth first search - layer 1 consists of all nodes
that are pointed by the root, layer k consists of all
nodes that are pointed by nodes on level k-1
•Diameter of a graph - maximum over all ordered
pairs (u,v) of the shortest path from u to v

December 2008

©GKGupta

9
Web size
•In-degree is number of links to a node
•Out-degree is the number of links from a node
•Fraction of pages with i in-links is proportional to 1/i 2.1
•With i out-links, it is 1/i2.72
i

23%

15%

3

10%

5%

4

5%

2%

5
©GKGupta

Outlinks

2

December 2008

Inlinks

3%

1%
10
Citations
Lotka’s Inverse-Square Law - Number of authors publishing
n papers is about 1/n2 of those with only one.
60% of all authors that make a single contribution.
Less than 1% publish 10 or more papers.
Most web pages are linked only to one other page (many
not linked to any). Number of pages with multiple in-links
declines quickly.
Rich get richer concept!
December 2008

©GKGupta

11
Web graph structure
Web may be considered to have four major components
• Central core – strongly connected component (SCC)
– pages that can reach one another along directed
links - about 30% of the Web
• IN group – can reach SCC but cannot be reached
from it - about 20%
• OUT group – can be reached from SCC but cannot
reach it - about 20%
December 2008

©GKGupta

12
Web graph structure
• Tendrils – cannot reach SCC and cannot be
reached by it - about 20%
• Unconnected – about 10%
The Web is hierarchical in nature. The Web has a
strong locality feature. Almost two thirds of all links
are to sites within the enterprise domain. Only onethird of the links are external. Higher percentage of
external links are broken. The distance between local
links tends to be quite small.
December 2008

©GKGupta

13
Web Size
• Diameter over 500
• Diameter of the central core SCC is about 30
• Probability that you can reach from any node a to
any b is 24%
• When you can, on average the path length is 16

December 2008

©GKGupta

14
Terminology
• Child (u) and parent (v) – obvious
• Bipartite graph – BG(T,I) is a graph whose node set
can be partitioned into two subsets T and I. Every
directed edge of BG joins a node in T to a node in I.
• A single page may be considered an example of BG
with T consisting of the page and I consisting of all its
children.

December 2008

©GKGupta

15
Terminology
• Dense BG – let p and q be nonzero integer variables
and tc and ic be the number of nodes in T and I. A DBG
is a BG where
•each node of T establishes an edge with at least p
(1≤p≤ic) nodes of I and
•at least q (1≤q≤tc) nodes of T establish an edge with
each node of I
• Complete BG or CBG – where p = ic and q = tc.

December 2008

©GKGupta

16
Web Content mining
Discovering useful information from contents of Web
pages.
Web content is very rich consisting of textual, image,
audio, video etc and metadata as well as hyperlinks.
The data may be unstructured (free text) or structured
(data from a database) or semi-structured (html)
although much of the Web is unstructured.

December 2008

©GKGupta

17
Web Content mining
Use of the search engines to find content in most
cases does not work well, posing an abundance
problem. Searching phrase “data mining” gives 2.6m
documents.
It provides no information about structure of content
that we are searching for and no information about
various categories of documents that are found.
Need more sophisticated tools for searching or
discovering Web content.
December 2008

©GKGupta

18
Similar Pages
Proliferation of similar or identical documents on the Web.
It has been found that almost 30% of all web pages are
very similar to other pages and about 22% are virtually
identical to other pages.
Many reasons for identical pages (and mirror sites), for
example, faster access or to reduce international
network traffic.
How to find similar documents?

December 2008

©GKGupta

19
Web Content
• New search algorithms
• Image searching
• Dealing with similar pages

December 2008

©GKGupta

20
Web Usage mining
Making sense of Web users' behaviour.
Based on data logs of users’ interactions of the Web
including referring pages and user identification. The
logs may be Web server logs, proxy server logs,
browser logs, etc.
Useful is finding user habits which can assist in
reorganizing a Web site so that high quality of service
may be provided.
December 2008

©GKGupta

21
Web Usage mining
Existing tools report the number of hits of Web pages
and where the hits came from. Although useful, the
information is not sufficient to learn user behaviour.
Tools providing further analysis of such information
are useful.
May involve usage pattern discovery e.g. the most
commonly traversed paths through a Web site or all
paths traversed through a Web site.

December 2008

©GKGupta

22
Web Usage mining
These patterns need to be interpreted, analyzed,
visualized and acted upon. One useful way to represent
this information might be graphs. The traversal
patterns provides us very useful information about the
logical structure of the Web site.
Association rule methodology may also be used in
discovering affinity between various Web pages.

December 2008

©GKGupta

23
Web data mining
• Early work by Kleinberg in 1997-98
• Links represent human judgement
• If the creator of p provides a link to q then it confers
some authority to q although that is not always true
• When search engine results are obtained should they
be ranked by number of in-links?

December 2008

©GKGupta

24
Web Structure mining
Discovering the link structure or model underlying
the Web.
The model is based on the topology of the hyperlinks.
This can help in discovering similarity between sites
or in discovering authority sites for a particular topic
or discipline or in discovering overview or survey sites
that point to many authority sites (called hubs).

December 2008

©GKGupta

25
Authority and Hub
• A hub is a page that has links to many authorities
• An authority is a page with good content on the
query topic and pointed by many hub pages, that is,
it is relevant and popular

December 2008

©GKGupta

26
Kleinberg’s algorithm
• Want all authority pages for say “data mining”
• Google returns 2.6m pages, may not even include
topics like clustering and classification
• We want s pages such that
1. s is relatively small
2. s is rich in relevant pages
3. s contains most of the strongest authorities

December 2008

©GKGupta

27
Kleinberg’s algorithm
• Conditions 1 and 2 (small and relevant) are satisfied
by selecting 100 or 200 most highly ranked from a
search engine
• Needs to use an algorithm to satisfy 3

December 2008

©GKGupta

28
Kleinberg’s algorithm
• Let R be the number of pages selected from search
result
•S=R
• For each page in S, do steps 3-5
• Let T be the set of all pages pointed by S
• Let F be the set of all pages pointed to S
• Let S := S + T + some or all of F
• Delete all links with the same domain name
• Return S

December 2008

©GKGupta

29
Kleinberg’s algorithm
• S is the base for a given query – hubs and authorities
are found from it
• One could order the pages in S by in-degrees but this
does not always work
• Authority pages should not only have large in-degrees
but also considerable overlap in the sets of pages that
point to them
• Thus we find hubs and authorities together

December 2008

©GKGupta

30
Kleinberg’s algorithm
• Hubs and authorities have a mutually reinforcing
relationship: a good hub points to many good
authorities; a good authority is pointed to by many
good hubs.
• We need a method of breaking this circularity

December 2008

©GKGupta

31
An iterative algorithm
• Let page p have an authority weight xp and a hub
weight yp
• Weights are normalized, squares of sum of each x
and y weights is 1
• If p points to many pages with large x weights, its y
weight is increased (call it O)
• If p is pointed to by many pages with large y values,
its x weight is increased (call it I)

December 2008

©GKGupta

32
An iterative algorithm
• On termination, the pages with largest weights are
the query authorities and hubs
Theorem: The sequences of x and y weights converge.
Proof: Let A be the adjacency matrix of G, that is, (i,j)
entry of A is 1 if (pi, pj) is an edge in G, 0 otherwise.
I and O operations may be written as
x  AT y
yAx
December 2008

©GKGupta

33
Kleinberg’s algorithm
•I and O operations may be written as
•x  AT y
•y  A x
•xk is the unit vector given by
•xk = (ATA)k-1x0
•yk is the unit vector given by
•yk = (AAT)y0
•Standard Linear algebra shows both converge.
December 2008

©GKGupta

34
Intuition
A page creator creates the page and displays his
interests by putting links to other pages of interest.
Multiple pages are created with similar interests. This
is captured by a DBG abstraction.
A CBG would extract a smaller set of potential
members although there is no guarantee such a core
will always exist in a community.

December 2008

©GKGupta

35
Intuition
A DBG abstraction is perhaps better than a CBG
abstraction for a community because a page-creator
would rarely put links to all the pages of interest in the
community.

December 2008

©GKGupta

36
Web Communities
• A Web community is a collection of Web pages that
have a common interest in a topic
• Many Web communities are independent of
geography or time zones
• How to find communities given a collection of
pages?
• It has been suggested that community core is a
complete bipartite graph (CBG) –can we use
Kleinberg’s algorithm?

December 2008

©GKGupta

37
Discovering communities
• Definition of community
A Web community is characterized by a collection of
pages that form a linkage pattern equal to a DBG(T,
I, α, β), where α, β are thresholds that specify linkage
density.

December 2008

©GKGupta

38
Discovering communities
• Assert that Web communities are characterized by a
collection of pages that form a linkage pattern equal to
DBG
• A seed set is given
• Apply a related page algorithm (e.g. Kleinberg) to each
seed, derive related pages

December 2008

©GKGupta

39
cocitation
• Citation analysis is study of links between
publications
• Cocitation measures the number of citations in
common between two documents
• Bibliographic coupling measures the number of
documents that cite both of the two documents

December 2008

©GKGupta

40
cocite
• Cocite is a set of pages is related, if there exist a set
of common children
• n pages are related under cocite if these pages have
common children at least equal to cocite-factor
• If a group of pages are related according to cocite
relationship, these pages form an appropriate CBG.

December 2008

©GKGupta

41

Mais conteúdo relacionado

Mais procurados

Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streamshktripathy
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMSkoolkampus
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series dataKrish_ver2
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streamsKrish_ver2
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data miningZHAO Sam
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patternsKrish_ver2
 

Mais procurados (20)

Lecture6 introduction to data streams
Lecture6 introduction to data streamsLecture6 introduction to data streams
Lecture6 introduction to data streams
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Chapter 15
Chapter 15Chapter 15
Chapter 15
 
Web mining
Web mining Web mining
Web mining
 
13. Query Processing in DBMS
13. Query Processing in DBMS13. Query Processing in DBMS
13. Query Processing in DBMS
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Graph mining ppt
Graph mining pptGraph mining ppt
Graph mining ppt
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Backpropagation algo
Backpropagation  algoBackpropagation  algo
Backpropagation algo
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
 
Inverted index
Inverted indexInverted index
Inverted index
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
5.3 mining sequential patterns
5.3 mining sequential patterns5.3 mining sequential patterns
5.3 mining sequential patterns
 

Semelhante a Web Data Mining Techniques

Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningAarshDhokai
 
Page rank talk at NTU-EE
Page rank talk at NTU-EEPage rank talk at NTU-EE
Page rank talk at NTU-EEPing Yeh
 
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and PagerankData Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and Pagerankijtsrd
 
Disc2013 keynote speakers
Disc2013 keynote speakersDisc2013 keynote speakers
Disc2013 keynote speakersHan Woo PARK
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawldata publica
 
Time critical user centred web design
Time critical user centred web designTime critical user centred web design
Time critical user centred web designAntony Groves
 
No Ki Magic: Managing Complex DITA Hyperdocuments
No Ki Magic: Managing Complex DITA HyperdocumentsNo Ki Magic: Managing Complex DITA Hyperdocuments
No Ki Magic: Managing Complex DITA HyperdocumentsContrext Solutions
 
Data mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirData mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirchaudharipruthvirajr
 
A survey of web metrics
A survey of web metricsA survey of web metrics
A survey of web metricsunyil96
 
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Martin Kaltenböck
 

Semelhante a Web Data Mining Techniques (20)

Web mining
Web miningWeb mining
Web mining
 
Introduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data MiningIntroduction to Web Mining and Spatial Data Mining
Introduction to Web Mining and Spatial Data Mining
 
Page rank talk at NTU-EE
Page rank talk at NTU-EEPage rank talk at NTU-EE
Page rank talk at NTU-EE
 
4.5 webminig
4.5 webminig 4.5 webminig
4.5 webminig
 
4.1 webminig
4.1 webminig 4.1 webminig
4.1 webminig
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Bigowl aitech
Bigowl aitechBigowl aitech
Bigowl aitech
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and PagerankData Processing in Web Mining Structure by Hyperlinks and Pagerank
Data Processing in Web Mining Structure by Hyperlinks and Pagerank
 
Disc2013 keynote speakers
Disc2013 keynote speakersDisc2013 keynote speakers
Disc2013 keynote speakers
 
Mapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawlMapping french open data actors on the web with common crawl
Mapping french open data actors on the web with common crawl
 
Time critical user centred web design
Time critical user centred web designTime critical user centred web design
Time critical user centred web design
 
Web mining
Web miningWeb mining
Web mining
 
No Ki Magic: Managing Complex DITA Hyperdocuments
No Ki Magic: Managing Complex DITA HyperdocumentsNo Ki Magic: Managing Complex DITA Hyperdocuments
No Ki Magic: Managing Complex DITA Hyperdocuments
 
Data mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sirData mining and warehouse by dr D. R. Patil sir
Data mining and warehouse by dr D. R. Patil sir
 
A survey of web metrics
A survey of web metricsA survey of web metrics
A survey of web metrics
 
Gaurav web mining
Gaurav web miningGaurav web mining
Gaurav web mining
 
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
Presentation ADEQUATe Project: Workshop on Quality Assessment and Improvement...
 
Web mining
Web miningWeb mining
Web mining
 
Phd defense slides
Phd defense slidesPhd defense slides
Phd defense slides
 

Mais de Institute of Technology Telkom

Mais de Institute of Technology Telkom (20)

Econopysics
EconopysicsEconopysics
Econopysics
 
Science and religion 100622120615-phpapp01
Science and religion 100622120615-phpapp01Science and religion 100622120615-phpapp01
Science and religion 100622120615-phpapp01
 
Konvergensi sains dan_spiritualitas
Konvergensi sains dan_spiritualitasKonvergensi sains dan_spiritualitas
Konvergensi sains dan_spiritualitas
 
Matematika arah kiblat mikrajuddin abdullah 2017
Matematika arah kiblat   mikrajuddin abdullah 2017Matematika arah kiblat   mikrajuddin abdullah 2017
Matematika arah kiblat mikrajuddin abdullah 2017
 
Iau solar effects 2005
Iau solar effects 2005Iau solar effects 2005
Iau solar effects 2005
 
Hfmsilri2jun14
Hfmsilri2jun14Hfmsilri2jun14
Hfmsilri2jun14
 
Fisika komputasi
Fisika komputasiFisika komputasi
Fisika komputasi
 
Computer Aided Process Planning
Computer Aided Process PlanningComputer Aided Process Planning
Computer Aided Process Planning
 
Archimedes
ArchimedesArchimedes
Archimedes
 
Web and text
Web and textWeb and text
Web and text
 
Time series Forecasting using svm
Time series Forecasting using  svmTime series Forecasting using  svm
Time series Forecasting using svm
 
Timeseries forecasting
Timeseries forecastingTimeseries forecasting
Timeseries forecasting
 
Fuzzy logic
Fuzzy logicFuzzy logic
Fuzzy logic
 
World population 1950--2050
World population 1950--2050World population 1950--2050
World population 1950--2050
 
neural networks
 neural networks neural networks
neural networks
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
 
System dynamics majors fair
System dynamics majors fairSystem dynamics majors fair
System dynamics majors fair
 
System dynamics math representation
System dynamics math representationSystem dynamics math representation
System dynamics math representation
 

Último

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 

Último (20)

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 

Web Data Mining Techniques

  • 1. Data Mining Chapter 5 Web Data Mining Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
  • 2. Web Data Mining Use of data mining techniques to automatically discover interesting and potentially useful information from Web documents and services. Web mining may be divided into three categories: 1. Web content mining 2. Web structure mining 3. Web usage mining December 2008 ©GKGupta 2
  • 3. Web details • More than 20 billion pages in 2008 • Many more documents in databases accessible from the Web • More than 4m servers • A total of perhaps 100 terabytes • More than a million pages are added daily • Several hundred gigabytes change every month • Hyperlinks for navigation, endorsement, citation, criticism or plain whim December 2008 ©GKGupta 3
  • 4. Web Index Size in Pages Number of Pages in Billions Google 16 MSN Search 7 Yahoo 50 Ask Total Web is estimated to be about 23B pages. Search Engine 4.2 Source http://www.worldwidewebsize.com/ 20 November 2008 December 2008 ©GKGupta 4
  • 5. Size of the Web GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista September 2003 December 2008 ©GKGupta 5
  • 6. Size Trends GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista September 2003 December 2008 ©GKGupta 6
  • 7. Web • Some 80% of Web pages are in English • About 30% of domains are in .com domain December 2008 ©GKGupta 7
  • 8. Graph terminology •Web is a graph – vertices and edges (V,E) •Directed graph – directed edges (p,q) •Undirected graph - undirected edges (p,q) •Strongly connected component - a set of nodes such that for any (u,v) there is a path from u to v •Breadth first search •Diameter of a graph •Average distance of the graph December 2008 ©GKGupta 8
  • 9. Graph terminology •Breadth first search - layer 1 consists of all nodes that are pointed by the root, layer k consists of all nodes that are pointed by nodes on level k-1 •Diameter of a graph - maximum over all ordered pairs (u,v) of the shortest path from u to v December 2008 ©GKGupta 9
  • 10. Web size •In-degree is number of links to a node •Out-degree is the number of links from a node •Fraction of pages with i in-links is proportional to 1/i 2.1 •With i out-links, it is 1/i2.72 i 23% 15% 3 10% 5% 4 5% 2% 5 ©GKGupta Outlinks 2 December 2008 Inlinks 3% 1% 10
  • 11. Citations Lotka’s Inverse-Square Law - Number of authors publishing n papers is about 1/n2 of those with only one. 60% of all authors that make a single contribution. Less than 1% publish 10 or more papers. Most web pages are linked only to one other page (many not linked to any). Number of pages with multiple in-links declines quickly. Rich get richer concept! December 2008 ©GKGupta 11
  • 12. Web graph structure Web may be considered to have four major components • Central core – strongly connected component (SCC) – pages that can reach one another along directed links - about 30% of the Web • IN group – can reach SCC but cannot be reached from it - about 20% • OUT group – can be reached from SCC but cannot reach it - about 20% December 2008 ©GKGupta 12
  • 13. Web graph structure • Tendrils – cannot reach SCC and cannot be reached by it - about 20% • Unconnected – about 10% The Web is hierarchical in nature. The Web has a strong locality feature. Almost two thirds of all links are to sites within the enterprise domain. Only onethird of the links are external. Higher percentage of external links are broken. The distance between local links tends to be quite small. December 2008 ©GKGupta 13
  • 14. Web Size • Diameter over 500 • Diameter of the central core SCC is about 30 • Probability that you can reach from any node a to any b is 24% • When you can, on average the path length is 16 December 2008 ©GKGupta 14
  • 15. Terminology • Child (u) and parent (v) – obvious • Bipartite graph – BG(T,I) is a graph whose node set can be partitioned into two subsets T and I. Every directed edge of BG joins a node in T to a node in I. • A single page may be considered an example of BG with T consisting of the page and I consisting of all its children. December 2008 ©GKGupta 15
  • 16. Terminology • Dense BG – let p and q be nonzero integer variables and tc and ic be the number of nodes in T and I. A DBG is a BG where •each node of T establishes an edge with at least p (1≤p≤ic) nodes of I and •at least q (1≤q≤tc) nodes of T establish an edge with each node of I • Complete BG or CBG – where p = ic and q = tc. December 2008 ©GKGupta 16
  • 17. Web Content mining Discovering useful information from contents of Web pages. Web content is very rich consisting of textual, image, audio, video etc and metadata as well as hyperlinks. The data may be unstructured (free text) or structured (data from a database) or semi-structured (html) although much of the Web is unstructured. December 2008 ©GKGupta 17
  • 18. Web Content mining Use of the search engines to find content in most cases does not work well, posing an abundance problem. Searching phrase “data mining” gives 2.6m documents. It provides no information about structure of content that we are searching for and no information about various categories of documents that are found. Need more sophisticated tools for searching or discovering Web content. December 2008 ©GKGupta 18
  • 19. Similar Pages Proliferation of similar or identical documents on the Web. It has been found that almost 30% of all web pages are very similar to other pages and about 22% are virtually identical to other pages. Many reasons for identical pages (and mirror sites), for example, faster access or to reduce international network traffic. How to find similar documents? December 2008 ©GKGupta 19
  • 20. Web Content • New search algorithms • Image searching • Dealing with similar pages December 2008 ©GKGupta 20
  • 21. Web Usage mining Making sense of Web users' behaviour. Based on data logs of users’ interactions of the Web including referring pages and user identification. The logs may be Web server logs, proxy server logs, browser logs, etc. Useful is finding user habits which can assist in reorganizing a Web site so that high quality of service may be provided. December 2008 ©GKGupta 21
  • 22. Web Usage mining Existing tools report the number of hits of Web pages and where the hits came from. Although useful, the information is not sufficient to learn user behaviour. Tools providing further analysis of such information are useful. May involve usage pattern discovery e.g. the most commonly traversed paths through a Web site or all paths traversed through a Web site. December 2008 ©GKGupta 22
  • 23. Web Usage mining These patterns need to be interpreted, analyzed, visualized and acted upon. One useful way to represent this information might be graphs. The traversal patterns provides us very useful information about the logical structure of the Web site. Association rule methodology may also be used in discovering affinity between various Web pages. December 2008 ©GKGupta 23
  • 24. Web data mining • Early work by Kleinberg in 1997-98 • Links represent human judgement • If the creator of p provides a link to q then it confers some authority to q although that is not always true • When search engine results are obtained should they be ranked by number of in-links? December 2008 ©GKGupta 24
  • 25. Web Structure mining Discovering the link structure or model underlying the Web. The model is based on the topology of the hyperlinks. This can help in discovering similarity between sites or in discovering authority sites for a particular topic or discipline or in discovering overview or survey sites that point to many authority sites (called hubs). December 2008 ©GKGupta 25
  • 26. Authority and Hub • A hub is a page that has links to many authorities • An authority is a page with good content on the query topic and pointed by many hub pages, that is, it is relevant and popular December 2008 ©GKGupta 26
  • 27. Kleinberg’s algorithm • Want all authority pages for say “data mining” • Google returns 2.6m pages, may not even include topics like clustering and classification • We want s pages such that 1. s is relatively small 2. s is rich in relevant pages 3. s contains most of the strongest authorities December 2008 ©GKGupta 27
  • 28. Kleinberg’s algorithm • Conditions 1 and 2 (small and relevant) are satisfied by selecting 100 or 200 most highly ranked from a search engine • Needs to use an algorithm to satisfy 3 December 2008 ©GKGupta 28
  • 29. Kleinberg’s algorithm • Let R be the number of pages selected from search result •S=R • For each page in S, do steps 3-5 • Let T be the set of all pages pointed by S • Let F be the set of all pages pointed to S • Let S := S + T + some or all of F • Delete all links with the same domain name • Return S December 2008 ©GKGupta 29
  • 30. Kleinberg’s algorithm • S is the base for a given query – hubs and authorities are found from it • One could order the pages in S by in-degrees but this does not always work • Authority pages should not only have large in-degrees but also considerable overlap in the sets of pages that point to them • Thus we find hubs and authorities together December 2008 ©GKGupta 30
  • 31. Kleinberg’s algorithm • Hubs and authorities have a mutually reinforcing relationship: a good hub points to many good authorities; a good authority is pointed to by many good hubs. • We need a method of breaking this circularity December 2008 ©GKGupta 31
  • 32. An iterative algorithm • Let page p have an authority weight xp and a hub weight yp • Weights are normalized, squares of sum of each x and y weights is 1 • If p points to many pages with large x weights, its y weight is increased (call it O) • If p is pointed to by many pages with large y values, its x weight is increased (call it I) December 2008 ©GKGupta 32
  • 33. An iterative algorithm • On termination, the pages with largest weights are the query authorities and hubs Theorem: The sequences of x and y weights converge. Proof: Let A be the adjacency matrix of G, that is, (i,j) entry of A is 1 if (pi, pj) is an edge in G, 0 otherwise. I and O operations may be written as x  AT y yAx December 2008 ©GKGupta 33
  • 34. Kleinberg’s algorithm •I and O operations may be written as •x  AT y •y  A x •xk is the unit vector given by •xk = (ATA)k-1x0 •yk is the unit vector given by •yk = (AAT)y0 •Standard Linear algebra shows both converge. December 2008 ©GKGupta 34
  • 35. Intuition A page creator creates the page and displays his interests by putting links to other pages of interest. Multiple pages are created with similar interests. This is captured by a DBG abstraction. A CBG would extract a smaller set of potential members although there is no guarantee such a core will always exist in a community. December 2008 ©GKGupta 35
  • 36. Intuition A DBG abstraction is perhaps better than a CBG abstraction for a community because a page-creator would rarely put links to all the pages of interest in the community. December 2008 ©GKGupta 36
  • 37. Web Communities • A Web community is a collection of Web pages that have a common interest in a topic • Many Web communities are independent of geography or time zones • How to find communities given a collection of pages? • It has been suggested that community core is a complete bipartite graph (CBG) –can we use Kleinberg’s algorithm? December 2008 ©GKGupta 37
  • 38. Discovering communities • Definition of community A Web community is characterized by a collection of pages that form a linkage pattern equal to a DBG(T, I, α, β), where α, β are thresholds that specify linkage density. December 2008 ©GKGupta 38
  • 39. Discovering communities • Assert that Web communities are characterized by a collection of pages that form a linkage pattern equal to DBG • A seed set is given • Apply a related page algorithm (e.g. Kleinberg) to each seed, derive related pages December 2008 ©GKGupta 39
  • 40. cocitation • Citation analysis is study of links between publications • Cocitation measures the number of citations in common between two documents • Bibliographic coupling measures the number of documents that cite both of the two documents December 2008 ©GKGupta 40
  • 41. cocite • Cocite is a set of pages is related, if there exist a set of common children • n pages are related under cocite if these pages have common children at least equal to cocite-factor • If a group of pages are related according to cocite relationship, these pages form an appropriate CBG. December 2008 ©GKGupta 41