Building an Intelligent Web: Theory & Practice

Building an Intelligent Web:
Theory and Practice
Th d P ti
Pawan Lingras
Saint Mary’s University
Rajendra Akerkar
American University of Armenia and SIBER, India

Discipline

Mathematics and Statistics Management
Computer Science

Chapters 1 – 8 excluding
shaded portion related to
Research Graduate Research Graduate mathematics and
implementation.

Information Chapters 1 – 8 excluding Chapters 2, 4 – 8 excluding
Complete Book Web Mining shaded portion related to shaded portion related to
Retrieval
implementation. implementation.

Chapters 1, 2, 3, 7
and 8 Chapters 4 - 8

Create a list of words

Remove stop words

Stem words

Calculate frequency of each stemmed
word

Figure 2.1 Transforming text document to a weighted list of keywords

Data Mining has emerged as one of the most exciting and dynamic
fields in computing science. The driving force for data mining is
the presence of petabyte-scale online archives that potentially
contain valuable bits of information hidden in them. Commercial
enterprises h
t i have bbeen quick t
i k to recognize th
i the value of thi
l f this
concept; consequently, within the span of a few years, the
software market itself for data mining is expected to be in excess
of $10 billion. Data mining refers to a family of techniques used
to detect interesting nuggets of relationships/knowledge in data.
While the theoretical underpinnings of the field have been around
for quite some time (in the form of pattern recognition,
statistics, data analysis and machine learning), the practice and
use of these techniques have been largely ad-hoc. With the
availability of large databases to store manage and assimilate
store,
data, the new thrust of data mining lies at the intersection of
database systems, artificial intelligence and algorithms that
efficiently analyze data. The distributed nature of several
databases, their size and the high complexity of many techniques
present interesting computational challenges.

1

0.75
0 75
ecision

0.5
Pre

0.25

0
0.25 0.5 0.75 1
Recall

Figure 2.43 Relationship between precision and recall
g p p

Semantic Web
The layer language model
(Berners-Lee, 2001; Broekstra et al, 2001)

<h1>Student Service Centre</h1>

Welcome to the home page of the Student Service Centre.

The centre is located in the main building of the University.

You may visit us for assistance during working days.

<h2>Office hours</h2>

Mon to Thu 8am - 6pm<br>

Fri 8am - 2pm<p>

But note that centre is not open during the weeks of the

<a href=”. . .”>State Of Origin</a>.

Figure 3.2 Example of a Web page of a Student Service Centre

<organization>

<serviceOffered>Admission</serviceOffered>

<organizationName>Student Service Centre</organizationName>

<staff>

<director>John Roth</director>

<secretary>Penny Brenner</secretary>

</staff>

</organization>

Figure 3.3 Example of a Web page of a Student Service Centre

Figure 3.4 Representing classes and instances (Noy et al., 2001)

Edward
lecturer @name
Bunker

course @title Algorithms

course Computati
@title onal
Algebra

lecturer @name

Daniela
Frost

Nonlinear
course @title
Analysis

root college

Sam
@name
Hoofer

Discrete
lecturer course @title
Structures

Modern
course
co rse @title
Algebra

Nonlinear
course @title
Analysis

location Innsbruck

Queries 1 and 2
Edward
lecturer @name
Bunker


course Computati
@title onal
Algebra

lecturer @name

Daniela
Frost

Nonlinear
course @title
Analysis

root college

Sam
@name Hoofer

Discrete
Structures

Modern
course @title
Algebra

Nonlinear
course @title
Analysis

location Innsbruck

Queries 3 and 4
Edward
lecturer @name
Bunker


course Computati
@title onal
Algebra

lecturer @name

Daniela
Frost

Nonlinear
course @title
Analysis

root college

Sam
@
@name Hoofer

Discrete
Structures

Modern
course @title
Algebra

Nonlinear
course @title
Analysis

location Innsbruck

<?xml version="1.0"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
p // g/ / / y #

xmlns:dc="http://purl.org/dc/elements/1.1/">

<rdf:Description rdf:about="">

<dc:title>

Building an Intelligent Web: Theory and Practice

</dc:title>

<dc:creator> Rajendra Akerkar and Pawan Lingras </dc:creator>

</rdf:Description>

</rdf:RDF>

Figure 3.26 Fragment of RDF


<rdf:RDF

xmlns:rdf http://www.w3.org/1999/02/22 rdf syntax ns#
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xmlns:my="http://www.myvehicle.com/vehicle-schema/">

<rdfs:Class rdf:about="#Vehicle"/>

<rdfs:Class rdf:about="#Car">

<rdfs:subClassOf rdf:resource="#Vehicle"/>

</rdfs:Class>

<rdf:Property rdf:about="#name">
df P t df b t "# "

<rdfs:domain rdf:resource="#Vehicle"/>

</rdf:Property>

<rdf:Description rdf:about="#Ford">

<rdf:type rdf:resource="#Car"/>

<my:name>Ford Icon</my:name>

</rdf:Description>

<my:Truck rdf:about="#Mitsubishi">

<my:name>Mitsubishi</my:name>

<my:carry rdf:resource="#Mitsubishi"/>

</my:Truck>

</rdf:RDF>

Figure 3.29 RDF/XML file for the automobile example

<topicMap id="tmrf"

xmlns = 'http://www.topicmaps.org/xtm/1.0/'

xmlns:xlink = 'http://www.w3.org/1999/xlink'>



.... here my topics and my associations go ...

</topicMap>

Figure 3.30 A Topic Map document
(Adopted from http://topicmaps.bond.edu.au/docs/6/1)

Classification and Association

Data Preparation

• Database Theory
• SQL
• Data Transformation
• http://www.ecn.purdue.edu/KDDCUP/data/

Classification
• Find a rule, a formula, or black box classifier for
organizing data into classes.
– Classify clients requesting loans into categories
based on the likelihood of repayment
p y
– Classify customers into Big or Moderate Spenders
based on what they buy
– Classify the customers into loyal, semi-loyal,
semi loyal,
infrequent based on the products they buy
• The classifier is developed from the data in the
training set
• The reliability of the classifier is evaluated using
the test set of data

Classification
• ID3 Algorithm
– Numerical Illustration
– Application to a Small E commerce Dataset
E-commerce
• C4.5 for Experimentation
• Other approaches
– Neural Networks
– Fuzzy Classification
– Rough Set Theory

Association
• Market basket analysis
– determine which things go together
• Transactions might reveal that
– customers who buy banana also buy candles
– cheese and pickled onions seem to occur frequently
in a shopping cart
• Information can be used for
– arranging a physical shop or structuring the Web site
– for targeted advertising campaign

Association

• Apriori Algorithm
• D
Demonstration f an E-commerce
t ti for E
Application

Clustering
• Breaks a large database into different
subgroups or clusters
• Unlike classification there are no
predefined classes
• Th clusters are put t
The l t t together on th basis
th the b i
of similarity to each other
• The data miners determine whether the
clusters offer any useful insight

5

4

3

2

1

0
0 1 2 3 4 5

Statistical Methods

• k – means
– Numerical Example
– Implementation
• Data Preparation
• Clustering
• Other Methods

Neural Network Based Approaches

• Kohonen Self Organising Maps
– Numerical Demonstration
– Application to Web Data Collection
• Oth Neural N t
Other N l Network B
k Based A
d Approaches
h

Web Mining

Web Content
W bC t t Web Structure
W b St t Web Usage
W bU
Mining Mining Mining

General
Web Page Search Result Customized
Access Pattern
Content Mining Mining Usage Tracking
Tracking

High level web usage mining process
(Srivastava et al., 2000)
(S i t t l

Applications of web usage mining
(Romanko, 2006; Srivastava et al., 2000)

140.14.6.11 - pawan [06/Sep/2001:10:46:07 -0300] "GET /s.htm HTTP/1.0" 200 2267

140.14.7.18 - raj [06/Sep/2001:11:23:53 -0300] "POST /s.cgi HTTP/1.0" 200 499

Classification exercise

Channel Recall Precision
Finance 44.3% 98.27%
Health 52.3%
52 3% 89.66%
89 66%
Market 49.1% 83.34%
News 44.1% 89.27%
Shopping 31.5% 91.31%
Specials 60.2% 92.86%
Sport 50.0% 91.93%
Surveys 21.9% 92.66%
Theatre 54.8% 94.63%

Table 6.8 Precision and recall for predicting user’s interest in channels
user s
(Baglioni, et al., 2003)

Association exercise

News Minimum Maximum Mean Standard
Section Requests Requests
q q Requests Deviation
q
Science 1 97 2.3034 2.8184
Culture 1 208 3.7878 5.9742
Sports 1 318 5.6985 10.8360
Economics 1 258 3.9335 7.2341
International 1 208 3.3823 5.5540
Local Lisbon
L l Li b 1 460 5.6883
5 6883 11.5650
11 5650
Local Port 1 256 7.5984 13.2351
Politics 1 208 3.3577 5.4101
Society 1 367 4.2673 7.9853
Education 1 90 2.6496 3.29090
Table 6.9 Summary statistics of requests to the Publico on-line newspaper
(Batista and Silva, 2002)

The association mining showed strong associations between the following pairs:

 Politics and Society

 Politics and International News

 Politics and Sports

 Society and International News

 Society and Local Lisbon

 S
Society and Sports
y Sp

 Society and Culture

 Sports and International News
p

Sequence Pattern Analysis of
Web L
W b Logs

Data Collection

• Web Crawlers
• Public
P blic Domain Web Cra lers
Crawlers
• An Implementation of a Web Crawler

Architecture of a search engine
(Romanko, 2006)

Other topics in Web Content Mining
• Search Engines
– How to prepare for and setup a search
engine
– Types and listings of search engines
(freeware, remote hosting services,
commercial)
• Multimedia Information Retrieval

0/10: The site or page is probably new.

3/10: The site is perhaps new, small in size and has very little or no worthwhile

arriving links. The page gets very little traffic.

5/10: The site has a fair amount of worthwhile arriving links and traffic volume. The

site might be larger in size and gets a good amount of steady traffic with some

return visitors.

8/10: The site has many arriving links, probably from other high PageRank pages.

The site perhaps contains a lot of information and has a higher traffic flow and

return visitor rate.
ii

10/10: The Web site is large, popular and has an extremely high number of links

pointing to it.

http://www.iprcom.com/papers/pagerank/
p p p p p g

Index quality for different search engines
(Henzinger, et al., 1999)

Index quality per page for different search engines

(Henzinger, et al., 1999)

Page Freq. Freq. Rank
Walk2 Walk1 Walk1

www.microsoft.com/ 3172 1600 1
www.microsoft.com/windows/ie/default.htm 2064 1045 3
www.netscape.com/ 1991 876 6
www.microsoft.com/ie/
www microsoft com/ie/ 1982 1017 4
www.microsoft.com/windows/ie/download/ 1915 943 5
www.microsoft.com/windows/ie/download/all.htm 1696 830 7
www.adobe.com/prodindex/acrobat/readstep.html 1634 780 8
home.netscape.com/ 1581 695 10
www.linkexchange.com/ 1574 763 9
www.yahoo.com/ 1527 1132 2

Table 8.2 Most frequently visited pages (Henzinger, et al., 1999)

Site Frequency Frequency Rank
Walk 2 Walk 1 Walk 1

www.microsoft.com 32452 16917 1
home.netscape.com 23329 11084 2
www.adobe.com 10884 5539 3
www.amazon.com 10146 5182 4
www.netscape.com 4862 2307 10
excite.netscape.com
excite netscape com 4714 2372 9
www.real.com 4494 2777 5
www.lycos.com 4448 2645 6
www.zdnet.com 4038 2562 8
www.linkexchange.com 3738 1940 12
www.yahoo.com
www yahoo com 3461 2595 7

Table 8.3 Most frequently visited hosts (Henzinger, et al., 1999)

Building an Intelligent Web: Theory & Practice

Recommended

Recommended

More Related Content

Similar to Building an Intelligent Web: Theory & Practice

Similar to Building an Intelligent Web: Theory & Practice (9)

More from R A Akerkar

More from R A Akerkar (20)

Recently uploaded

Recently uploaded (20)

Building an Intelligent Web: Theory & Practice