SlideShare uma empresa Scribd logo
1 de 109
Graph Mining - Motivation,
Applications and Algorithms
Graph mining seminar of Prof. Ehud Gudes
Fall 2008/9
Outline
• Introduction
• Motivation and applications of Graph Mining
• Mining Frequent Subgraphs – Transaction setting
– BFS/Apriori Approach (FSG and others)
– DFS Approach (gSpan and others)
– Greedy Approach

• Mining Frequent Subgraphs – Single graph setting
– The support issue
– The path-based algorithm
What is Data Mining?
Data Mining also known as Knowledge Discovery
in Databases (KDD) is the process of extracting
useful hidden information from very large
databases in an unsupervised manner.
Mining Frequent Patterns:
What is it good for?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
• Motivation: Finding inherent regularities in data
– What products were often purchased together?
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we classify web documents using frequent patterns?
The Apriori principle:
Downward closure Property
• All subsets of a frequent itemset must also be frequent
– Because any transaction that contains X must also contains subset of X.

• If we have already verified that X is infrequent,
there is no need to count X supersets because they must be
infrequent too.
Outline
• Introduction
• Motivation and applications of Graph Mining
• Mining Frequent Subgraphs – Transaction setting
– BFS/Apriori Approach (FSG and others)
– DFS Approach (gSpan and others)
– Greedy Approach

• Mining Frequent Subgraphs – Single graph setting
– The support issue
– Path mining algorithm
What Graphs are good for?
• Most of existing data mining algorithms are based on
transaction representation, i.e., sets of items.
• Datasets with structures, layers, hierarchy and/or
geometry often do not fit well in this transaction
setting. For e.g.
–
–
–
–

Numerical simulations
3D protein structures
Chemical Compounds
Generic XML files.
Graph Based Data Mining
• Graph Mining is the problem of discovering repetitive
subgraphs occurring in the input graphs.
• Motivation:
– finding subgraphs capable of compressing the data by
abstracting instances of the substructures.
– identifying conceptually interesting patterns
?Why Graph Mining
• Graphs are everywhere
– Chemical compounds (Cheminformatics)
– Protein structures, biological pathways/networks (Bioinformactics)
– Program control flow, traffic flow, and workflow analysis
– XML databases, Web, and social network analysis

• Graph is a general model
– Trees, lattices, sequences, and items are degenerated graphs

• Diversity of graphs
– Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with
angles & geometry (topological vs. 2-D/3-D)

• Complexity of algorithms: many problems are of high complexity (NP
complete or even P-SPACE !)
from H. Jeong et al Nature 411, 41 (2001)

Graphs, Graphs, Everywhere

Aspirin

Internet

Yeast protein interaction network

Co-author network
Modeling Data With Graphs…
Going Beyond Transactions
Data Instance
Graphs are suitable for
capturing arbitrary
relations between the
.various elements

Element
Element’s Attributes

Graph Instance
Vertex
Vertex Label

Relation Between
Two Elements

Edge

Type Of Relation

Edge Label

Relation between
a Set of Elements

Hyper Edge

Provide enormous flexibility for modeling the underlying data as they allow the
modeler to decide on what the elements should be and the type of relations to
be modeled
Graph Pattern Mining
• Frequent subgraphs
– A (sub)graph is frequent if its support (occurrence
frequency) in a given dataset is no less than a minimum
support threshold
• Applications of graph pattern mining:
– Mining biochemical structures
– Program control flow analysis
– Mining XML structures or Web communities
– Building blocks for graph classification, clustering,
compression, comparison, and correlation analysis
Example 1
GRAPH DATASET

(T1)

(T2)

(T3)

FREQUENT PATTERNS
(MIN SUPPORT IS 2)

(1)

(2)
Example 2
GRAPH DATASET

FREQUENT PATTERNS
(MIN SUPPORT IS 2)
Graph Mining Algorithms
•

Simple path patterns (Chen,Park,Yu 98)

•

Generalized path patterns (Nanopoulos,Manolopoulos 01)

•

Simple tree patterns (Lin,Liu,Zhang, Zhou 98)

•

Tree-like patterns (Wang,Huiqing,Liu 98)

•

General graph patterns (Kuramochi,Karypis 01, Han 02 etc.)
Graph mining methods

•

Apriori-based approach

• Pattern-growth approach
Apriori-Based Approach
k-graph

G

(k+1)-graph

G1
G2

G’

…
Gn

G’’
join
Pattern Growth Method
(k+2)-graph
(k+1)-graph

G1
k-graph

G

…
duplicate
graphs

G2
…
Gn

…
Outline
• Introduction
• Motivation and applications of Graph Mining
• Mining Frequent Subgraphs – Transaction setting
– BFS/Apriori Approach (FSG and others)
– DFS Approach (gSpan and others)
– Greedy Approach

• Mining Frequent Subgraphs – Single graph setting
– The support issue
– Path mining algorithm
– Constraint-based mining
Transaction Setting


Input: (D, minSup)
 Set of labeled-graphs transactions D={T , T , …, T }
1
2
N
Minimum support minSup
Output: (All frequent subgraphs).
 A subgraph is frequent if it is a subgraph of at least minSup⋅|
D| (or #minSup) different transactions in D.
 Each subgraph is connected.



Single graph setting


Input: (D, minSup)
 A single graph D (e.g. the Web or DBLP or an XML file)
 Minimum support minSup



Output: (All frequent subgraphs).
 A subgraph is frequent if the number of its occurrences in D
is above an admissible support measure (measure that
satisfies the downward closure property).
Graph Mining: Transaction Setting
Finding Frequent Subgraphs:
Input and Output
Input: Graph Transactions

Input
– Database of graph transactions.
– Undirected simple graph
(no loops, no multiples edges).
– Each graph transaction has labels
associated with its vertices and edges.
– Transactions may not be connected.
– Minimum support threshold σ.

Output
– Frequent subgraphs that satisfy the
minimum support constraint.
– Each frequent subgraph is connected.

Output: Frequent Connected Subgraphs

Support = 100%

Support = 66%

Support = 66%
Different Approaches for GM
• Apriori Approach
– FSG
– Path Based

• DFS Approach
– gSpan

• Greedy Approach
– Subdue
FSG Algorithm
[M. Kuramochi and G. Karypis. Frequent subgraph discovery. ICDM 2001]

Notation: k-subgraph is a subgraph with k edges.
Init: Scan the transactions to find F1, the set of all frequent
1-subgraphs and 2-subgraphs, together with their counts;
For (k=3; Fk-1 ≠ ∅ ; k++)
1. Candidate Generation - Ck, the set of candidate k-subgraphs, from Fk-1,
the set of frequent (k-1)-subgraphs;
2. Candidates pruning - a necessary condition of candidate to be
frequent is that each of its (k-1)-subgraphs is frequent.
3. Frequency counting - Scan the transactions to count the occurrences
of subgraphs in Ck;
4. Fk = { c ∈CK | c has counts no less than #minSup }
5. Return F1 ∪ F2 ∪ ……∪ Fk (= F )
Trivial operations are complicated with
graphs
• Candidate generation
– To determine two candidates for joining, we need to check
for graph isomorphism.

• Candidate pruning
– To check downward closure property, we need graph
isomorphism.

• Frequency counting
– Subgraph isomorphism for checking containment of a
frequent subgraph.
Candidates generation (join) based on core
detection
+

+

+
Candidate Generation Based On
. (Core Detection (cont
First Core

Second Core

First Core

Second Core

Multiple cores
between two
(k-1)-subgraphs
Candidate pruning:downward closure property

:candidates-3

:candidates-4

• Every (k-1)-subgraph
must be frequent.
• For all the (k-1)subgraphs of a given
k-candidate, check if
downward closure
property holds
frequent
1-subgraphs

frequent
2-subgraphs

3-candidates

frequent
3-subgraphs
4-candidates
...

...

frequent
4-subgraphs
Computational Challenges
Simple operations become complicated & expensive when dealing with graphs…

• Candidate generation
– To determine if we can join two candidates, we need to perform subgraph
isomorphism to determine if they have a common subgraph.
– There is no obvious way to reduce the number of times that we generate the same
subgraph.
– Need to perform graph isomorphism for redundancy checks.
– The joining of two frequent subgraphs can lead to multiple candidate subgraphs.

• Candidate pruning
– To check downward closure property, we need subgraph isomorphism.

• Frequency counting
– Subgraph isomorphism for checking containment of a frequent subgraph
Computational Challenges
• Key to FSG’s computational efficiency:
– Uses an efficient algorithm to determine a canonical labeling
of a graph and use these “strings” to perform identity checks
(simple comparison of strings!).
– Uses a sophisticated candidate generation algorithm that
reduces the number of times each candidate is generated.
– Uses an augmented TID-list based approach to speedup
frequency counting.
FSG: Canonical representation for graphs (based
on adjacency matrix)
a

M1 :

Code(M1) = “abyzx”
Code(M2) = “abaxyz”

a
a

b

Graph G:
z

y

x

y z
y
x
z x
a

M2 :
a

a

b

Code(G) = min{ code(M) | M is adj. Matrix}

a
b
a

a b

b a

x y
x
z
y z
Canonical labeling
v1

v0

B

v2

B

A
B

v3

B




v 0

 v1
v 2

 v3
v
 4
v5





v3

 v1
v 2

v 4
v
 5
v0


v4

A

v5

v0

v1

v2

v3

v4

B

B

B

B

A

1

1

B
B
B
B

1
1
1
1

1
1

1

A

1

A

1

v5 
A






1





Label = “1 01 011 0001 00010”

v3
B
B
B
B
A
A
B

1
1
1
1

v1
B
1
1

1

v2
B
1
1

v4 v5 v0 
A A B


1 1

1








Label = “1 11 100 1000 01000”
FSG: Finding the canonical labeling

– The problem is as complex as graph isomorphism,
but FSG suggests some heuristics to speed it up
such as:
• Vertex Invariants (e.g. degree)
• Neighbor lists
• Iterative partitioning
Another FSG Heuristic: frequency counting
Transactions
gk-11 , gk-12 ⊂ T1
g

k-1

1

⊂ T2

gk-11 , gk-12 ⊂ T3
gk-12 ⊂ T6
g

k-1

1

⊂ T8

gk-11 , gk-12 ⊂ T9

Frequent subgraphs
TID(gk-11) = { 1, 2, 3, 8, 9 }
TID(gk-12) = { 1, 3, 6, 9 }
Candidate
ck = join(gk-11, gk-12)
TID(ck) ⊂ TID(gk-11) ∩ TID(gk-12)
⇓
TID(ck ) ⊂ { 1, 3, 9}

• Perform subgraph-iso to T1, T3 and T9 with ck and determine TID(ck)
• Note, TID lists require a lot of memory.
FSG performance on DTP Dataset (chemical
( compounds
10000

1400

9000

Running Time [sec]

1200

8000

#Patterns

7000

1000

6000

800

5000

600

4000
3000

400

2000

200

1000

0

0
1

2

3

4

5

6

7

Minim um Support [ % ]

8

9

10

Num ber of Pat t erns
Discovered

Running Tim e [ sec]

1600
(Topology Is Not Enough (Sometimes
H

I

H

H

H

H

H

H

O

H

H

O

H

H

H

H

O

H

H

H

H

H

H

H

H

O

H

H
H

H

H H
O

H

H

H

H
H

H

H

H

H

H

H

H

O
H

H

H

H

H

H

H
• Graphs arising from physical domains
have a strong geometric nature.

– This geometry must be taken into
account by the data-mining algorithms.

• Geometric graphs.
H

H
H

– Vertices have physical 2D and 3D
coordinates associated with them.
gFSG—Geometric Extension Of FSG
((Kuramochi & Karypis ICDM 2002
• Same input and same output as
FSG
– Finds frequent geometric connected
subgraphs

• Geometric version of (sub)graph
isomorphism

– The mapping of vertices can be translation,
rotation, and/or scaling invariant.
– The matching of coordinates can be inexact
as long as they are within a tolerance radius
of r.
• R-tolerant geometric isomorphism.

C

B
A
Different Approaches for GM
• Apriori Approach
– FSG
– Path Based

• DFS Approach
– gSpan

• Greedy Approach
– Subdue

Y. Xifeng and H. Jiawei
gSpan: Graph-Based
Substructure Pattern Mining
ICDM, 2002.
gSpan outline
part1:
Define the Tree Search Space (TSS)
Part2:
Find all frequent graphs
by exploring TSS
Motivation: DFS exploration wrt. itemsets.
Itemset search space – prefix based
abcde
abcd abce

abde

acde

bcde

abc abd abe acd ace ade

ab

a

ac

ad

ae

b

bc

bcd

bd

bce

be

c

bde

cd

cde

ce

d

de

e
Motivation for TSS
•

Canonical representation of itemset is obtained
by a complete order over the items.
• Each possible itemset appear in TSS exactly once
- no duplications or omissions.
• Properties of Tree search space
– for each k-label, its parent is the k-1 prefix of
the given k-label
– The relation among siblings is in ascending
lexicographic order.
DFS Code representation
• Map each graph (2-Dim) to a sequential DFS Code (1Dim).
• Lexicographically order the codes.
• Construct TSS based on the lexicographic order.
DFS Code construction
• Given a graph G. for each Depth First Search over graph G,
construct the corresponding DFS-Code.

)a(

)b(

v0 X
a

Y
b
b

X
c
Z

)c(

v0 X
v1

a
d
Z

b

a

Y
b
X
c
Z

)d(

v0 X

d
Z

b

v1

a

Y

a

b
d
X
c v2 Z
Z

a

Y
b

b

X

c v2
Z

)f (

v0 X

v0 X

v1

a

)e(
v1

a

v0 X

a

Y

b
Z

d

X

c v2
Z v

Z
3

)X,a,Y,0,1( )Y,b,X,1,2( )X,a,X,2,0( )X,c,Z,2,3(

b

v0 X

v1

a

b

d

)g(
v1

a

a

Y

b
d
X
c v2 Z
Z v
3

a

a

Y
b

b

d

X

c v2
Z v

Z
3

v4

)Z,b,Y,3,1( )Y,d,Z,1,4(
Single graph, several DFS-Codes
G

v0 X

X
a

(a)

(b)

(c)

b
b

1

(0, 1, X, a, Y)

(0, 1, Y, a, X)

(0, 1, X, a, X)

2

(1, 2, Y, b, X)

(1, 2, X, a, X)

(2, 0, X, a, X)

(2, 0, X, b, Y)

(2, 0, Y, b, X)

d

X

b
Z

c

(1, 2, X, a, Y)

3

v1

a

Y

Z

a

Y
b
X

a
d

v2 Zv4
c
Z v3

)a(
v0

Y

d

a

v0

v4

X
a

Z

(2, 3, X, c, Z)

(2, 3, X, c, Z)

(2, 3, Y, b, Z)

b

v1 X

b

v2 X

4

v2 Y

d

b

v1 X

b

a

5

(3, 1, Z, b, Y)

(3, 0, Z, b, Y)

(3, 0, Z, c, X)

c

6

(1, 4, Y, d, Z)

(0, 4, Y, d, Z)

(2, 4, Y, d, Z)

)b(

Z

v3

c

a

Z

v3

)c(

Z

v4
Single graph - single Min DFS-code!
Min
DFS-Code
(a)
1

(b)
(0, 1, Y, a, X)

(0, 1, X, a, X)

(1, 2, Y, b, X)

(1, 2, X, a, X)

v1

a

Y
b
b

2

v0 X

X
a

(c)

(0, 1, X, a, Y)

G

d

X

b
Z

c
Z

(1, 2, X, a, Y)

a

Y
b
X

a
d

v2 Zv4
c
Z v3

)a(
3

(2, 0, X, a, X)

(2, 0, X, b, Y)

(2, 0, Y, b, X)

v0

Y

d

a

4

(2, 3, X, c, Z)

(2, 3, X, c, Z)

(2, 3, Y, b, Z)
b

5

(3, 1, Z, b, Y)

(3, 0, Z, b, Y)

(3, 0, Z, c, X)

(1, 4, Y, d, Z)

(0, 4, Y, d, Z)

(2, 4, Y, d, Z)

v4

X
a

Z

v1 X

b

v2 X

v2 Y

c

d

b

v1 X

b

a

Z

6

v0

v3

)b(

c

a

Z

v3

)c(

Z

v4
Minimum DFS-Code
• The minimum DFS code min(G), in DFS lexicographic
order, is a canonical representation of graph G.
• Graphs A and B are isomorphic if and only if:
min(A) = min(B)
DFS-Code Tree: parent-child relation
• If min(G1) = { a0, a1, ….., an}
and min(G2) = { a0, a1, ….., an, b}
– G1 is parent of G2
– G2 is child of G1

• A valid DFS code requires that b grows from a vertex
on the rightmost path (inherited property from the
DFS search).
v0

Graph G1:

a

a

X
Y v1

b

d

X

b

c v2
Z

v3

Z

v4

) = Min(g )X,a,Y,0,1( )Y,b,X,1,2( )X,a,X,2,0( )X,c,Z,2,3(

)Z,b,Y,3,1( )Y,d,Z,1,4(

A child of graph G1 must grow edge from rightmost path of
G1 (necessary condition)
v0

Graph G2:
a

wrong

a

?

X

Y v1 ?

b
b

v5

?

?

?

v5

v0
a

? v5

c v2
Z

v3

Z

v4

Forward EDGE

?

?

?

v5

b

?

Y v1

b

d

X

a

X

d

X

c v2
Z

v3

Z

v4

Backward EDGE
Search space: DFS code Tree
• Organize DFS Code nodes as parent-child.
• Sibling nodes organized in ascending DFS
lexicographic order.
• InOrder traversal follows DFS lexicographic order!
0
A
1

A
C

…
C 3

2

Min
DFS-Code
0

0
A

1

A

C

A
1

C
2

1
2

B
2

1

B

2

C

0 A

1

A
1

2

0
A
1

S

1

C

1

0
B
1

1

C

3

C

1

0
C
C

2

1 C
2

0
C
1

C
2

3

2

0
B
B

2

B
2

0
B

2

C

C

1

B

1
3

C

0

B

C

0
A
B

0

2

B
3

0

A

1

C

3

0
A

0

A

’S

A
3

PRUNED

0

0

A

Not Min
DFS-Code
Tree pruning
• All of the descendants of infrequent node are
infrequent also.
• All of the descendants of a not minimal DFS code are
also not minimal DFS codes.
part1:
defining the Tree Search Space (TSS)
Part2:
gSpan Finds all frequent graphs
by Exploring TSS
gSpan Algorithm
gSpan(D, F, g)
1: if g ≠ min(g)
return;
2: F ← F ∪ { g }
3: children(g) ← [generate all g’ potential children with one edge growth]*
4: Enumerate(D, g, children(g))
5: for each c ∈ children(g)
if support(c) ≥ #minSup
SubgraphMining (D, F, c)
___________________________

* gSpan improve this line
Example

Given: database D
T1

c

T2

a

b
c

b

a
c

a

c c b
c

a

T3

b
a

a
c

a
c

Task: Mine all frequent subgraphs with support ≥ 2 (#minSup)

b

a
b
T1

a

b
c

b

a

c c b

a

c

0
A A
1 C
2

a

c

T3

b

a
c

a

TID={1,3{
C 3

A A
C
2

{TID={1,3

TID={1,2,3{
0
A

0 A
A
1

{TID={1,2,3

1
2
0
A
1

B

TID={1,2,3{
0
A
1
2

C

0
B
1

a
c

TID={1,3{

0
1

c

T2

0
C
1

b

a
b
T1

c

T2

a

b

b

a

c

c c b

a

c

a

c

T3

b

a
c

a

c

0
A A
1 C
C 3

2

TID={1,2{
0
1

0
A A
C
2

1

A C
B
2

TID={1,2,3{
0
A

0 A
A
1

1
2
0
A
1

B

TID={1,2,3{
0
A
1
2

C

0
B
1

a

0
C
1

b

a
b
T1

c

T2

a

b

b

a

c

c c b

a

c

T3

a

b

c

a
a

c

a

b

c

b

0
A A
1 C
C 3

2

0
0
1

0
A A
C
2

1

1

A C
2

2

1
2

B

A

1

C

3
0
A

0
A
1

C

0

A

1

B
2

B

0 A
A
1

0

A

B

C
2

3
0
A
1
2

0
C
3
0
B
1

C

2
0
B
1

B

0

B

1

C

2

B 3

0
B
1
2

C

0
C
1

B

1
2

a

C
C 3

0
C
1 C
2
gSpan Performance
• On synthetic datasets it was 6-10 times faster than
FSG.
• On Chemical compounds datasets it was 15-100
times faster!
• But this was comparing to OLD versions of FSG!
Different Approaches for GM
• Apriori Approach
– FSG
– Path Based

• DFS Approach
– gSpan

• Greedy Approach
– Subdue

D. J. Cook and L. B. Holder
Graph-Based Data Mining
Tech. report, Department of CS
Engineering, 1998
Graph Pattern Explosion Problem
• If a graph is frequent, all of its subgraphs are frequent ─ the
Apriori property
• An n-edge frequent graph may have 2n subgraphs.
• Among 422 chemical compounds which are confirmed to be
active in an AIDS antiviral screen dataset, there are 1,000,000
frequent graph patterns if the minimum support is 5%.
Subdue algorithm
• A greedy algorithm for finding some of the most
prevalent subgraphs.
• This method is not complete, i.e. it may not obtain all
frequent subgraphs, although it pays in fast
execution.
Subdue algorithm (Cont.)
• It discovers substructures that compress the original
data and represent structural concepts in the data.
• Based on Beam Search - like BFS it progresses level by
level. Unlike BFS, however, beam search moves
downward only through the best W nodes at each
level. The other nodes are ignored.
Subdue algorithm: step 1
Step 1: Create substructure for each unique vertex label
Substructures:

DB:
triangle
on
left
square

circle

on
rectangle
on
on
on
left
left
triangle
triangle
triangle
on
on
on
left
left
square
square
square

triangle (4)
square (4)
circle (1)
rectangle (1)
Subdue Algorithm: step 2
Step 2: Expand best substructure by an edge or edge and
neighboring vertex
Substructures:

DB:
triangle
on
square

left

circle

on
rectangle
on
on
on
left
left
triangle
triangle
triangle
on
on
on
left
left
square
square
square

square

left

square
on
rectangle

circle

rectangle
on
triangle

triangle
on
square
Subdue Algorithm: steps 3-5

Step 3: Keep only best substructures on queue (specified by
beam width).
Step 4: Terminate when queue is empty or when the number
of discovered substructures is greater than or equal to
the limit specified.
Step 5: Compress graph and repeat to generate hierarchical
description.
Outline
• Introduction
• Motivation and applications for Graph mining
• Mining Frequent Subgraphs – Transaction setting
– BFS/Apriori Approach (FSG and others)
– DFS Approach (gSpan and others)
– Greedy Approach

• Mining Frequent Subgraphs – Single graph setting
– The support issue
– Path mining algorithm
– Constraint-based mining
Single Graph Setting
Most existing algorithms use a transaction setting
approach.
That is, if a pattern appears in a transaction even
multiple times it is counted as 1 (FSG, gSPAN ).
What if the entire database is a single graph?
This is called single graph setting.
We need a different support definition!
Single graph setting - Motivation





Often the input is a single large graph.
Examples:
 The web or portions of it.
 A social network (e.g. a network of users communicating by
email at BGU).
 A large XML database such as DBLP or Movies database.
Mining large graph databases is very useful.
Support issue
Support measure is admissible if for any pattern P
and any sub-pattern Q ⊂ P support of P is not larger
than support of Q.
Problem: the number of pattern appearances is not good!
Support issue

An instance graph of pattern P in database graph D is a graph
whose nodes are pattern instances in D and they are
connected by an edge when corresponding instances share
an edge.
Support issue
Operations on instance graph:
• clique contraction: replace clique C by a single node c. Only the
nodes adjacent to each node of C may be adjacent to c.
 node expansion: replace node v by a new subgraph whose nodes
may or may not be adjacent to the nodes adjacent to v.
 node addition: add a new node to the graph and arbitrary edges
between the new node and the old ones.
 edge removal : remove an edge.
The main result
Theorem. A support measure S is an admissible
support measure if and only if it is non-decreasing on
instance graph of every pattern P under clique
contraction, node expansion, node addition and edge
removal.
Example of support measure - MIS

MIS

Maximum independent set size of instance graph
= _____________________________________
Number of edges in the database graph
Path mining algorithm (Vanetik, Gudes, Shimony)
 Goal: find all frequent connected subgraphs
of a database graph.
 Basic approach: Apriori or BFS.
 The basic building block is a path not an edge.
 This works since any graph can be decomposed into
edge-disjoint paths.
 Result: faster convergence of the algorithm.
Path-based mining algorithm
• The algorithm uses paths as basic building blocks for pattern
construction.
• It starts with one-path graphs and combines them into 2-, 3etc. path graphs.
• The combination technique does not use graph operations
and is easy to implement.
• Path number of a graph is computed in linear time: it is the
number of odd-degree vertices divided by two.
• Given minimal path cover P, removal of one path creates a
graph with minimal path cover size |P|-1.
• There exist at least two paths in P whose removal leaves the
graph connected.
More than one path cover for graph
1. Define a descriptor of each path based on node labels and node
degrees.
2. Use lexicographical order among descriptors to compare
between paths.
3. One graph can have several minimal path covers.
4. We only use path covers that are minimal w.r.t. lexicographical
order.
5. Removal of path from a lexicographically minimal path cover
leaves the cover lexicographically minimal.
Example: path descriptors
P1 = v1,v2,v3,v4,v5

P2 = v1,v5,v2

Desc (P1) = <A,0,1>,<A,1,1>,<B,1,0>,<C,1,1>,<D,1,1>
Desc (P2) = <A,0,1>,<A,1,0>,<B,1,1>
Path mining algorithm

 Phase 1: find all frequent 1-path graphs.

 Phase 2: find all frequent 2-path graphs by
“joining” frequent 1-path graphs.
 Phase 3: find all frequent k-path graphs, k≥3,
by “joining” pairs of frequent (k-1)-path graphs.
Main challenge: “join” must ensure soundness
and completeness of the algorithm.
Graph as collection of paths: table representation
Node

P2

P3

v1
Graph composed
from 3 paths:

P1
a1

⊥

⊥

v2

a2

b2

⊥

v3

a3

⊥

⊥

v4

⊥

b1

⊥

v5

⊥

b3

c3

v6

⊥

⊥

c1

v7

⊥

⊥

c2
Removing path from table
Node

P1

P2

P3

v1

a1

⊥

⊥

v2

a2

b2

⊥

v3

a3

⊥

v4

⊥

v5

Node

P1

P2

v1

a1

⊥

⊥

v2

a2

b2

b1

⊥

v3

a3

⊥

⊥

b3

c3

v4

⊥

b1

v6

⊥

⊥

c1

v5

⊥

b3

v7

⊥

⊥

c2

delete P3
Join graphs with common paths: the sum operation
C1

C3

C2

P1 P2 P3

P1 P2 P4

v1 a1

v2 a2 b2

+

v3 a3

v3 a3

v4

b1

v4

b1 d1

v5

b3 c3

v5

b3

v6

c1

v6

d2

v7

c2

v7

d3

Join on P1,P2

v1

a1

v2

a2

v3

v1 a1

v2 a2 b2

P1 P2 P 3 P 4

a3

b2

v4

b1

d1

v4

b3 c3

v6

c1

v7

c2

v8

d2

v9

d3
The sum operation: how it looks on graphs
The sum is not enough: the splice operation
•
•
•
•
•
•

We need to construct a frequent n-path graph G on paths
P1,…,Pn.
We have two frequent (n-1)-path graphs, G1 on paths P1,
…,Pn-1 and G2 on paths P2,…,Pn.
The sum of G1 and G2 will give us n-path graph G’ on paths
P1,…,Pn.
G’=G if P1 and Pn have no common node that belongs
solely to them.
A frequent 2-path graph H containing P1 and Pn exactly as
they appear in G exists if G is frequent.
Let us join the nodes of P1 and Pn in G’ according to H.
This is the splice operation!
The splice operation: an example
G1

G3

P2 P3
v1

v1

v2

v2

v3

v3

v4

v4 b1

v5

v5 b3 c3

v6

v6

c1

v7

v7

c2

G2

P1 P2

P3

P2 P3

v1

a1

v2 b1 c1

v2

a2

v4

Splice G1
with G2

b2

b2

v3

a3

v5 b3 c3

v4

b1

c1

v7

v5

b3

c3

c2

v6

b2

c2
The splice operation: how it looks on graph
Labeled graphs: we mind the labels
We join only nodes that have the same labels!
Path mining algorithm
1.
2.
3.
4.
5.

Find all frequent edges.
Find frequent paths by adding one edge at a time
(not all nodes are suitable for this!)
Find all frequent 2-path graphs by exhaustive joining.
Set k=2.
While frequent k-path graphs exist:
a) Perform sum operation on pairs of frequent k-path graphs
where applicable.
b) Perform splice operation on generated (k+1)-path candidates
To get additional (k+1)-path candidates.
c) Compute support for (k+1)-path candidates.
d) Eliminate non-frequent candidates and set k:=k+1.
e) Go to 5.
Complexity
Exponential – as the number of frequent patterns can be exponential
on the size of the database (like any Apriori alg.)
Difficult tasks: (NP hard)
1. Support computation that consists of:
a. Finding all instances of a frequent pattern in
the database. (sub-graph isomorphism)
b. Computing MIS (maximum independent set size)
of an instance graph.
Relatively easy tasks:
1. Candidate set generation:
polynomial on the size of frequent set from
previous iteration,
2. Elimination of isomorphic candidate patterns:
graph isomorphism computation is at worst
exponential on the size of a pattern, not the database.
Additional Approaches for Single
Graph Setting
• BFS Approach
– hSiGram

• DFS Approach
– vSiGram

M. Kuramochi and G. Karypis
Finding Frequent Patterns in a
Large Sparse Graph
In Proc. Of SIAM 2004.

• Both use approximations of the MIS
measure
Conclusions
•

Data Mining field proved its practicality during its short lifetime with effective
DM algorithms.

•

Many applications in Databases, Chemistry&Biology, Networks, etc.

•

Both Transaction and Single graph settings are important

•

Graph Mining is:

– Dealing with designing effective algorithms for mining graph datasets.
– Facing many hardness problems on the way.
– Fast growing field with many possibilities of evolving unseen before.

•

As more and more information is stored in complicated structures, we need to
develop new set of algorithms for Graph Data Mining.
Some References
[1] T. Washio A. Inokuchi and H.~Motoda, An Apriori-Based Algorithm for Mining Frequent
Substructures from Graph Data, Proceedings of the 4th PKDD'00, 2000, pages 13-23.
[2] M. Kuramochi and G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs, Tech.
report, Department of Computer Science/Army HPC Research Center, 2002.
[3] N. Vanetik, E.Gudes, and S. E. Shimony, Computing Frequent Graph Patterns from
Semistructured Data, Proceedings of the 2002 IEEE ICDM'02
[4] Y. Xifeng and H. Jiawei, gspan: Graph-Based Substructure Pattern Mining, Tech. report,
University of Illinois at Urbana-Champaign, 2002.
[5] W. Wang J. Huan and J. Prins, Efficient Mining of Frequent Subgraphs in the Presence of
Isomorphism, Proceedings of the 3rd IEEE ICDM'03 p.~549.
[6] Moti Cohen, Ehud Gudes, Diagonally Subgraphs Pattern Mining. DMKD 2004, pages 51-58, 2004
[7] D. J. Cook and L. B. Holder, Graph-Dased Data Mining, Tech. report, Department of CS
Engineering, 1998.
Documents Classification:
Alternative Representation of Multilingual
Web Documents:The Graph-Based Model
Introduced in A. Schenker, H. Bunke, M. Last, A. Kandel,
Graph-Theoretic Techniques for Web Content Mining, World
Scientific, 2005
The Graph-Based Model of Web Documents
• Basic ideas:

– One node for each unique term
– If word B follows word A, there is an edge from A to B

• In the presence of terminating punctuation marks (periods,
question marks, and exclamation points) no edge is created
between two words

– Graph size is limited by including only the most frequent terms
– Several variations for node and edge labeling (see the next
slides)

• Pre-processing steps

– Stop words are removed
– Lemmatization

• Alternate forms of the same term (singular/plural,
past/present/future tense, etc.) are conflated to the most
frequently occurring form
The Standard Representation
• Edges are labeled according to the document section where
the words are followed by each other
– Title (TI) contains the text related to the document’s title and any
provided keywords (meta-data);
– Link (L) is the “anchor text” that appears in clickable hyper-links on the
document;
– Text (TX) comprises any of the visible text in the document (this
includes anchor text but not title and keyword text)

TI
YAHOO

L
NEWS

TX

TX

REPORTS

SERVICE
TX

MORE

REUTERS
Graph Based Document Representation – Detailed
Example
Source: www.cnn.com, May 24, 2005
Graph Based Document Representation - Parsing

title
link

text
Standard Graph Based Document
Representation
TX

Ten most frequent
terms are used

CAR

Word

3

Killing

2

Bomb

2

Wounding

2

Driver

2

Exploded

1

Baghdad

1

Internationa
l

1

CNN

1

Car

1

TX

Frequency

Iraqis

KILLING

DRIVER

TX

Text

TX

L
BOMB

TX

Link
EXPLODED

IRAQIS

TX

BAGHDAD

TX

WOUNDING

Title
INTERNATIONAL

TI

CNN
Classification using graphs
• Basic idea:
– Mine the frequent sub-graphs, call them terms
– Use TFIDF for assigning the most characteristic terms to
documents
– Use Clustering and K-nearest neighbors classification
Subgraph Extraction
• Input
– G – training set of directed, unique nodes graphs
– CRmin - Minimum Classification Rate

• Output
– Set of classification-relevant sub-graphs

• Process:
– For each class find subgraphs CR > CRmin
– Combine all sub-graphs into one set

• Basic Assumption
– Classification-Relevant Sub-Graphs are more frequent in a specific
category than in other categories
Computing the Classification Rate
• Subgraph Classification Rate:

CR ( g ′k ( ci ) ) = SCF ( g ′k ( ci ) ) × ISF ( g ′k ( ci ) )

• SCF (g’k(ci)) - Subgraph Class Frequency of subgraph g’k in category ci
• ISF (g’k(ci)) - Inverse Subgraph Frequency of subgraph g’k in category ci
• Classification Relevant Feature is a feature that best explains a specific
category, or frequent in this category more than in all others
k-Nearest Neighbors with Graphs
Accuracy vs. Graph Size
86%

Classification Accuracy

82%

78%

74%

70%
1

2

3

4

5

6

7

8

9

Number of Nearest Neighbors (k)

Vector model (cosine)
Graphs (70 nodes/graph)

Vector model (Jaccard)
Graphs (100 nodes/graph)

Graphs (40 nodes/graph)
Graphs (150 nodes/graph)

10
Graph Mining Seminar Overview

Mais conteúdo relacionado

Mais procurados

Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Rehan Guha
 
非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習Yu Sugawara
 
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kambererror007
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesHumberto Marchezi
 
Anomaly Detection Technique
Anomaly Detection TechniqueAnomaly Detection Technique
Anomaly Detection TechniqueChakrit Phain
 
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...Seunghyun Hwang
 
Association rule Mining
Association rule MiningAssociation rule Mining
Association rule Miningafsana40
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsChamin Nalinda Loku Gam Hewage
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleImpetus Technologies
 
Anomaly Detection Using Generative Adversarial Network(GAN)
Anomaly Detection Using Generative Adversarial Network(GAN)Anomaly Detection Using Generative Adversarial Network(GAN)
Anomaly Detection Using Generative Adversarial Network(GAN)Asha Aher
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slidesQuantUniversity
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningKuppusamy P
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processingVijayasankariS
 

Mais procurados (20)

CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)CSC446: Pattern Recognition (LN6)
CSC446: Pattern Recognition (LN6)
 
Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)Parametric & Non-Parametric Machine Learning (Supervised ML)
Parametric & Non-Parametric Machine Learning (Supervised ML)
 
非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習非定常データストリームにおける適応的決定木を用いたアンサンブル学習
非定常データストリームにおける適応的決定木を用いたアンサンブル学習
 
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & KamberChapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Chapter - 8.3 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
 
Anomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time SeriesAnomaly Detection in Seasonal Time Series
Anomaly Detection in Seasonal Time Series
 
Anomaly Detection Technique
Anomaly Detection TechniqueAnomaly Detection Technique
Anomaly Detection Technique
 
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
Energy-based Model for Out-of-Distribution Detection in Deep Medical Image Se...
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
Association rule Mining
Association rule MiningAssociation rule Mining
Association rule Mining
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
 
Branch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection AlgorithmsBranch And Bound and Beam Search Feature Selection Algorithms
Branch And Bound and Beam Search Feature Selection Algorithms
 
Anomaly detection with machine learning at scale
Anomaly detection with machine learning at scaleAnomaly detection with machine learning at scale
Anomaly detection with machine learning at scale
 
Anomaly Detection Using Generative Adversarial Network(GAN)
Anomaly Detection Using Generative Adversarial Network(GAN)Anomaly Detection Using Generative Adversarial Network(GAN)
Anomaly Detection Using Generative Adversarial Network(GAN)
 
Anomaly detection Workshop slides
Anomaly detection Workshop slidesAnomaly detection Workshop slides
Anomaly detection Workshop slides
 
Em algorithm
Em algorithmEm algorithm
Em algorithm
 
Anomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine LearningAnomaly detection (Unsupervised Learning) in Machine Learning
Anomaly detection (Unsupervised Learning) in Machine Learning
 
What is Machine Learning
What is Machine LearningWhat is Machine Learning
What is Machine Learning
 
Evaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROCEvaluation metrics: Precision, Recall, F-Measure, ROC
Evaluation metrics: Precision, Recall, F-Measure, ROC
 
Data warehousing and online analytical processing
Data warehousing and online analytical processingData warehousing and online analytical processing
Data warehousing and online analytical processing
 

Destaque

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureTobias Kuhn
 
20110501 csseminar rybalkin_substructure_search
20110501 csseminar rybalkin_substructure_search20110501 csseminar rybalkin_substructure_search
20110501 csseminar rybalkin_substructure_searchComputer Science Club
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And MiningSrinath Srinivasa
 
LEXBFS on Chordal Graphs
LEXBFS on Chordal GraphsLEXBFS on Chordal Graphs
LEXBFS on Chordal Graphsnazlitemu
 
Effective community search_dami2015
Effective community search_dami2015Effective community search_dami2015
Effective community search_dami2015Nicola Barbieri
 
Modeling adoptions and the stages of the diffusion of innovations
Modeling adoptions and the stages of the diffusion of innovationsModeling adoptions and the stages of the diffusion of innovations
Modeling adoptions and the stages of the diffusion of innovationsNicola Barbieri
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosBigMine
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningPier Luca Lanzi
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysisvwchu
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDatamining Tools
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 publicYasuo Tabei
 

Destaque (16)

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
20110501 csseminar rybalkin_substructure_search
20110501 csseminar rybalkin_substructure_search20110501 csseminar rybalkin_substructure_search
20110501 csseminar rybalkin_substructure_search
 
Jayant lrs
Jayant lrsJayant lrs
Jayant lrs
 
CSMR11b.ppt
CSMR11b.pptCSMR11b.ppt
CSMR11b.ppt
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
 
LEXBFS on Chordal Graphs
LEXBFS on Chordal GraphsLEXBFS on Chordal Graphs
LEXBFS on Chordal Graphs
 
Effective community search_dami2015
Effective community search_dami2015Effective community search_dami2015
Effective community search_dami2015
 
Modeling adoptions and the stages of the diffusion of innovations
Modeling adoptions and the stages of the diffusion of innovationsModeling adoptions and the stages of the diffusion of innovations
Modeling adoptions and the stages of the diffusion of innovations
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 
DMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph MiningDMTM 2015 - 19 Graph Mining
DMTM 2015 - 19 Graph Mining
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Lgm saarbrucken
Lgm saarbruckenLgm saarbrucken
Lgm saarbrucken
 
Lgm pakdd2011 public
Lgm pakdd2011 publicLgm pakdd2011 public
Lgm pakdd2011 public
 

Semelhante a Graph Mining Seminar Overview

Associations.ppt
Associations.pptAssociations.ppt
Associations.pptQuyn590023
 
Associations1
Associations1Associations1
Associations1mancnilu
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern MiningPrakhar Dhama
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesSurvey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesKasun Gajasinghe
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
A Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining TechniquesA Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining Techniquesijsrd.com
 
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduceFiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduceIJCSIS Research Publications
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchArtemSunfun
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphsStanka Dalekova
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalramya marichamy
 
Variation reference graphs and the variation graph toolkit vg
Variation reference graphs and the variation graph toolkit vgVariation reference graphs and the variation graph toolkit vg
Variation reference graphs and the variation graph toolkit vgGenome Reference Consortium
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 

Semelhante a Graph Mining Seminar Overview (20)

graph_mining_seminar_2009.ppt
graph_mining_seminar_2009.pptgraph_mining_seminar_2009.ppt
graph_mining_seminar_2009.ppt
 
call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...call for papers, research paper publishing, where to publish research paper, ...
call for papers, research paper publishing, where to publish research paper, ...
 
sequencea.ppt
sequencea.pptsequencea.ppt
sequencea.ppt
 
Associations.ppt
Associations.pptAssociations.ppt
Associations.ppt
 
data_mining.pptx
data_mining.pptxdata_mining.pptx
data_mining.pptx
 
Associations1
Associations1Associations1
Associations1
 
Temporal Pattern Mining
Temporal Pattern MiningTemporal Pattern Mining
Temporal Pattern Mining
 
Survey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - SlidesSurvey on Frequent Pattern Mining on Graph Data - Slides
Survey on Frequent Pattern Mining on Graph Data - Slides
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Ijariie1129
Ijariie1129Ijariie1129
Ijariie1129
 
A Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining TechniquesA Survey of Sequential Rule Mining Techniques
A Survey of Sequential Rule Mining Techniques
 
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduceFiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
6 module 4
6 module 46 module 4
6 module 4
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
Connected Components Labeling
Connected Components LabelingConnected Components Labeling
Connected Components Labeling
 
Mining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactionalMining single dimensional boolean association rules from transactional
Mining single dimensional boolean association rules from transactional
 
Variation reference graphs and the variation graph toolkit vg
Variation reference graphs and the variation graph toolkit vgVariation reference graphs and the variation graph toolkit vg
Variation reference graphs and the variation graph toolkit vg
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 

Mais de Houw Liong The

Museumgeologi 130427165857-phpapp02
Museumgeologi 130427165857-phpapp02Museumgeologi 130427165857-phpapp02
Museumgeologi 130427165857-phpapp02Houw Liong The
 
Research in data mining
Research in data miningResearch in data mining
Research in data miningHouw Liong The
 
Fisika & komputasi cerdas
Fisika & komputasi cerdasFisika & komputasi cerdas
Fisika & komputasi cerdasHouw Liong The
 
Sharma : social networks
Sharma : social networksSharma : social networks
Sharma : social networksHouw Liong The
 
Introduction to-graph-theory-1204617648178088-2
Introduction to-graph-theory-1204617648178088-2Introduction to-graph-theory-1204617648178088-2
Introduction to-graph-theory-1204617648178088-2Houw Liong The
 
Chaper 13 trend, Han & Kamber
Chaper 13 trend, Han & KamberChaper 13 trend, Han & Kamber
Chaper 13 trend, Han & KamberHouw Liong The
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberHouw Liong The
 
Chapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & KamberChapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & KamberHouw Liong The
 
Web & text mining lecture10
Web & text mining lecture10Web & text mining lecture10
Web & text mining lecture10Houw Liong The
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningHouw Liong The
 
Chapter 09 classification advanced
Chapter 09 classification advancedChapter 09 classification advanced
Chapter 09 classification advancedHouw Liong The
 
System dynamics prof nagurney
System dynamics prof nagurneySystem dynamics prof nagurney
System dynamics prof nagurneyHouw Liong The
 
System dynamics math representation
System dynamics math representationSystem dynamics math representation
System dynamics math representationHouw Liong The
 

Mais de Houw Liong The (20)

Museumgeologi 130427165857-phpapp02
Museumgeologi 130427165857-phpapp02Museumgeologi 130427165857-phpapp02
Museumgeologi 130427165857-phpapp02
 
Space weather
Space weather Space weather
Space weather
 
Research in data mining
Research in data miningResearch in data mining
Research in data mining
 
Indonesia
IndonesiaIndonesia
Indonesia
 
Canfis
CanfisCanfis
Canfis
 
Climate Change
Climate ChangeClimate Change
Climate Change
 
Space Weather
Space Weather Space Weather
Space Weather
 
Fisika komputasi
Fisika komputasiFisika komputasi
Fisika komputasi
 
Fisika & komputasi cerdas
Fisika & komputasi cerdasFisika & komputasi cerdas
Fisika & komputasi cerdas
 
Climate model
Climate modelClimate model
Climate model
 
Sharma : social networks
Sharma : social networksSharma : social networks
Sharma : social networks
 
Introduction to-graph-theory-1204617648178088-2
Introduction to-graph-theory-1204617648178088-2Introduction to-graph-theory-1204617648178088-2
Introduction to-graph-theory-1204617648178088-2
 
Chaper 13 trend, Han & Kamber
Chaper 13 trend, Han & KamberChaper 13 trend, Han & Kamber
Chaper 13 trend, Han & Kamber
 
Capter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & KamberCapter10 cluster basic : Han & Kamber
Capter10 cluster basic : Han & Kamber
 
Chapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & KamberChapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & Kamber
 
Web & text mining lecture10
Web & text mining lecture10Web & text mining lecture10
Web & text mining lecture10
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
 
Chapter 09 classification advanced
Chapter 09 classification advancedChapter 09 classification advanced
Chapter 09 classification advanced
 
System dynamics prof nagurney
System dynamics prof nagurneySystem dynamics prof nagurney
System dynamics prof nagurney
 
System dynamics math representation
System dynamics math representationSystem dynamics math representation
System dynamics math representation
 

Último

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 

Último (20)

Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 

Graph Mining Seminar Overview

  • 1. Graph Mining - Motivation, Applications and Algorithms Graph mining seminar of Prof. Ehud Gudes Fall 2008/9
  • 2. Outline • Introduction • Motivation and applications of Graph Mining • Mining Frequent Subgraphs – Transaction setting – BFS/Apriori Approach (FSG and others) – DFS Approach (gSpan and others) – Greedy Approach • Mining Frequent Subgraphs – Single graph setting – The support issue – The path-based algorithm
  • 3. What is Data Mining? Data Mining also known as Knowledge Discovery in Databases (KDD) is the process of extracting useful hidden information from very large databases in an unsupervised manner.
  • 4. Mining Frequent Patterns: What is it good for? • Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set • Motivation: Finding inherent regularities in data – What products were often purchased together? – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we classify web documents using frequent patterns?
  • 5. The Apriori principle: Downward closure Property • All subsets of a frequent itemset must also be frequent – Because any transaction that contains X must also contains subset of X. • If we have already verified that X is infrequent, there is no need to count X supersets because they must be infrequent too.
  • 6. Outline • Introduction • Motivation and applications of Graph Mining • Mining Frequent Subgraphs – Transaction setting – BFS/Apriori Approach (FSG and others) – DFS Approach (gSpan and others) – Greedy Approach • Mining Frequent Subgraphs – Single graph setting – The support issue – Path mining algorithm
  • 7. What Graphs are good for? • Most of existing data mining algorithms are based on transaction representation, i.e., sets of items. • Datasets with structures, layers, hierarchy and/or geometry often do not fit well in this transaction setting. For e.g. – – – – Numerical simulations 3D protein structures Chemical Compounds Generic XML files.
  • 8. Graph Based Data Mining • Graph Mining is the problem of discovering repetitive subgraphs occurring in the input graphs. • Motivation: – finding subgraphs capable of compressing the data by abstracting instances of the substructures. – identifying conceptually interesting patterns
  • 9. ?Why Graph Mining • Graphs are everywhere – Chemical compounds (Cheminformatics) – Protein structures, biological pathways/networks (Bioinformactics) – Program control flow, traffic flow, and workflow analysis – XML databases, Web, and social network analysis • Graph is a general model – Trees, lattices, sequences, and items are degenerated graphs • Diversity of graphs – Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted, with angles & geometry (topological vs. 2-D/3-D) • Complexity of algorithms: many problems are of high complexity (NP complete or even P-SPACE !)
  • 10. from H. Jeong et al Nature 411, 41 (2001) Graphs, Graphs, Everywhere Aspirin Internet Yeast protein interaction network Co-author network
  • 11.
  • 12. Modeling Data With Graphs… Going Beyond Transactions Data Instance Graphs are suitable for capturing arbitrary relations between the .various elements Element Element’s Attributes Graph Instance Vertex Vertex Label Relation Between Two Elements Edge Type Of Relation Edge Label Relation between a Set of Elements Hyper Edge Provide enormous flexibility for modeling the underlying data as they allow the modeler to decide on what the elements should be and the type of relations to be modeled
  • 13. Graph Pattern Mining • Frequent subgraphs – A (sub)graph is frequent if its support (occurrence frequency) in a given dataset is no less than a minimum support threshold • Applications of graph pattern mining: – Mining biochemical structures – Program control flow analysis – Mining XML structures or Web communities – Building blocks for graph classification, clustering, compression, comparison, and correlation analysis
  • 14. Example 1 GRAPH DATASET (T1) (T2) (T3) FREQUENT PATTERNS (MIN SUPPORT IS 2) (1) (2)
  • 15. Example 2 GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2)
  • 16. Graph Mining Algorithms • Simple path patterns (Chen,Park,Yu 98) • Generalized path patterns (Nanopoulos,Manolopoulos 01) • Simple tree patterns (Lin,Liu,Zhang, Zhou 98) • Tree-like patterns (Wang,Huiqing,Liu 98) • General graph patterns (Kuramochi,Karypis 01, Han 02 etc.)
  • 17. Graph mining methods • Apriori-based approach • Pattern-growth approach
  • 20. Outline • Introduction • Motivation and applications of Graph Mining • Mining Frequent Subgraphs – Transaction setting – BFS/Apriori Approach (FSG and others) – DFS Approach (gSpan and others) – Greedy Approach • Mining Frequent Subgraphs – Single graph setting – The support issue – Path mining algorithm – Constraint-based mining
  • 21. Transaction Setting  Input: (D, minSup)  Set of labeled-graphs transactions D={T , T , …, T } 1 2 N Minimum support minSup Output: (All frequent subgraphs).  A subgraph is frequent if it is a subgraph of at least minSup⋅| D| (or #minSup) different transactions in D.  Each subgraph is connected.  
  • 22. Single graph setting  Input: (D, minSup)  A single graph D (e.g. the Web or DBLP or an XML file)  Minimum support minSup  Output: (All frequent subgraphs).  A subgraph is frequent if the number of its occurrences in D is above an admissible support measure (measure that satisfies the downward closure property).
  • 24. Finding Frequent Subgraphs: Input and Output Input: Graph Transactions Input – Database of graph transactions. – Undirected simple graph (no loops, no multiples edges). – Each graph transaction has labels associated with its vertices and edges. – Transactions may not be connected. – Minimum support threshold σ. Output – Frequent subgraphs that satisfy the minimum support constraint. – Each frequent subgraph is connected. Output: Frequent Connected Subgraphs Support = 100% Support = 66% Support = 66%
  • 25. Different Approaches for GM • Apriori Approach – FSG – Path Based • DFS Approach – gSpan • Greedy Approach – Subdue
  • 26. FSG Algorithm [M. Kuramochi and G. Karypis. Frequent subgraph discovery. ICDM 2001] Notation: k-subgraph is a subgraph with k edges. Init: Scan the transactions to find F1, the set of all frequent 1-subgraphs and 2-subgraphs, together with their counts; For (k=3; Fk-1 ≠ ∅ ; k++) 1. Candidate Generation - Ck, the set of candidate k-subgraphs, from Fk-1, the set of frequent (k-1)-subgraphs; 2. Candidates pruning - a necessary condition of candidate to be frequent is that each of its (k-1)-subgraphs is frequent. 3. Frequency counting - Scan the transactions to count the occurrences of subgraphs in Ck; 4. Fk = { c ∈CK | c has counts no less than #minSup } 5. Return F1 ∪ F2 ∪ ……∪ Fk (= F )
  • 27. Trivial operations are complicated with graphs • Candidate generation – To determine two candidates for joining, we need to check for graph isomorphism. • Candidate pruning – To check downward closure property, we need graph isomorphism. • Frequency counting – Subgraph isomorphism for checking containment of a frequent subgraph.
  • 28. Candidates generation (join) based on core detection + + +
  • 29. Candidate Generation Based On . (Core Detection (cont First Core Second Core First Core Second Core Multiple cores between two (k-1)-subgraphs
  • 30. Candidate pruning:downward closure property :candidates-3 :candidates-4 • Every (k-1)-subgraph must be frequent. • For all the (k-1)subgraphs of a given k-candidate, check if downward closure property holds
  • 32. Computational Challenges Simple operations become complicated & expensive when dealing with graphs… • Candidate generation – To determine if we can join two candidates, we need to perform subgraph isomorphism to determine if they have a common subgraph. – There is no obvious way to reduce the number of times that we generate the same subgraph. – Need to perform graph isomorphism for redundancy checks. – The joining of two frequent subgraphs can lead to multiple candidate subgraphs. • Candidate pruning – To check downward closure property, we need subgraph isomorphism. • Frequency counting – Subgraph isomorphism for checking containment of a frequent subgraph
  • 33. Computational Challenges • Key to FSG’s computational efficiency: – Uses an efficient algorithm to determine a canonical labeling of a graph and use these “strings” to perform identity checks (simple comparison of strings!). – Uses a sophisticated candidate generation algorithm that reduces the number of times each candidate is generated. – Uses an augmented TID-list based approach to speedup frequency counting.
  • 34. FSG: Canonical representation for graphs (based on adjacency matrix) a M1 : Code(M1) = “abyzx” Code(M2) = “abaxyz” a a b Graph G: z y x y z y x z x a M2 : a a b Code(G) = min{ code(M) | M is adj. Matrix} a b a a b b a x y x z y z
  • 35. Canonical labeling v1 v0 B v2 B A B v3 B    v 0   v1 v 2   v3 v  4 v5     v3   v1 v 2  v 4 v  5 v0  v4 A v5 v0 v1 v2 v3 v4 B B B B A 1 1 B B B B 1 1 1 1 1 1 1 A 1 A 1 v5  A       1     Label = “1 01 011 0001 00010” v3 B B B B A A B 1 1 1 1 v1 B 1 1 1 v2 B 1 1 v4 v5 v0  A A B   1 1  1        Label = “1 11 100 1000 01000”
  • 36. FSG: Finding the canonical labeling – The problem is as complex as graph isomorphism, but FSG suggests some heuristics to speed it up such as: • Vertex Invariants (e.g. degree) • Neighbor lists • Iterative partitioning
  • 37. Another FSG Heuristic: frequency counting Transactions gk-11 , gk-12 ⊂ T1 g k-1 1 ⊂ T2 gk-11 , gk-12 ⊂ T3 gk-12 ⊂ T6 g k-1 1 ⊂ T8 gk-11 , gk-12 ⊂ T9 Frequent subgraphs TID(gk-11) = { 1, 2, 3, 8, 9 } TID(gk-12) = { 1, 3, 6, 9 } Candidate ck = join(gk-11, gk-12) TID(ck) ⊂ TID(gk-11) ∩ TID(gk-12) ⇓ TID(ck ) ⊂ { 1, 3, 9} • Perform subgraph-iso to T1, T3 and T9 with ck and determine TID(ck) • Note, TID lists require a lot of memory.
  • 38. FSG performance on DTP Dataset (chemical ( compounds 10000 1400 9000 Running Time [sec] 1200 8000 #Patterns 7000 1000 6000 800 5000 600 4000 3000 400 2000 200 1000 0 0 1 2 3 4 5 6 7 Minim um Support [ % ] 8 9 10 Num ber of Pat t erns Discovered Running Tim e [ sec] 1600
  • 39. (Topology Is Not Enough (Sometimes H I H H H H H H O H H O H H H H O H H H H H H H H O H H H H H H O H H H H H H H H H H H H O H H H H H H H • Graphs arising from physical domains have a strong geometric nature. – This geometry must be taken into account by the data-mining algorithms. • Geometric graphs. H H H – Vertices have physical 2D and 3D coordinates associated with them.
  • 40. gFSG—Geometric Extension Of FSG ((Kuramochi & Karypis ICDM 2002 • Same input and same output as FSG – Finds frequent geometric connected subgraphs • Geometric version of (sub)graph isomorphism – The mapping of vertices can be translation, rotation, and/or scaling invariant. – The matching of coordinates can be inexact as long as they are within a tolerance radius of r. • R-tolerant geometric isomorphism. C B A
  • 41. Different Approaches for GM • Apriori Approach – FSG – Path Based • DFS Approach – gSpan • Greedy Approach – Subdue Y. Xifeng and H. Jiawei gSpan: Graph-Based Substructure Pattern Mining ICDM, 2002.
  • 42. gSpan outline part1: Define the Tree Search Space (TSS) Part2: Find all frequent graphs by exploring TSS
  • 43. Motivation: DFS exploration wrt. itemsets. Itemset search space – prefix based abcde abcd abce abde acde bcde abc abd abe acd ace ade ab a ac ad ae b bc bcd bd bce be c bde cd cde ce d de e
  • 44. Motivation for TSS • Canonical representation of itemset is obtained by a complete order over the items. • Each possible itemset appear in TSS exactly once - no duplications or omissions. • Properties of Tree search space – for each k-label, its parent is the k-1 prefix of the given k-label – The relation among siblings is in ascending lexicographic order.
  • 45. DFS Code representation • Map each graph (2-Dim) to a sequential DFS Code (1Dim). • Lexicographically order the codes. • Construct TSS based on the lexicographic order.
  • 46. DFS Code construction • Given a graph G. for each Depth First Search over graph G, construct the corresponding DFS-Code. )a( )b( v0 X a Y b b X c Z )c( v0 X v1 a d Z b a Y b X c Z )d( v0 X d Z b v1 a Y a b d X c v2 Z Z a Y b b X c v2 Z )f ( v0 X v0 X v1 a )e( v1 a v0 X a Y b Z d X c v2 Z v Z 3 )X,a,Y,0,1( )Y,b,X,1,2( )X,a,X,2,0( )X,c,Z,2,3( b v0 X v1 a b d )g( v1 a a Y b d X c v2 Z Z v 3 a a Y b b d X c v2 Z v Z 3 v4 )Z,b,Y,3,1( )Y,d,Z,1,4(
  • 47. Single graph, several DFS-Codes G v0 X X a (a) (b) (c) b b 1 (0, 1, X, a, Y) (0, 1, Y, a, X) (0, 1, X, a, X) 2 (1, 2, Y, b, X) (1, 2, X, a, X) (2, 0, X, a, X) (2, 0, X, b, Y) (2, 0, Y, b, X) d X b Z c (1, 2, X, a, Y) 3 v1 a Y Z a Y b X a d v2 Zv4 c Z v3 )a( v0 Y d a v0 v4 X a Z (2, 3, X, c, Z) (2, 3, X, c, Z) (2, 3, Y, b, Z) b v1 X b v2 X 4 v2 Y d b v1 X b a 5 (3, 1, Z, b, Y) (3, 0, Z, b, Y) (3, 0, Z, c, X) c 6 (1, 4, Y, d, Z) (0, 4, Y, d, Z) (2, 4, Y, d, Z) )b( Z v3 c a Z v3 )c( Z v4
  • 48. Single graph - single Min DFS-code! Min DFS-Code (a) 1 (b) (0, 1, Y, a, X) (0, 1, X, a, X) (1, 2, Y, b, X) (1, 2, X, a, X) v1 a Y b b 2 v0 X X a (c) (0, 1, X, a, Y) G d X b Z c Z (1, 2, X, a, Y) a Y b X a d v2 Zv4 c Z v3 )a( 3 (2, 0, X, a, X) (2, 0, X, b, Y) (2, 0, Y, b, X) v0 Y d a 4 (2, 3, X, c, Z) (2, 3, X, c, Z) (2, 3, Y, b, Z) b 5 (3, 1, Z, b, Y) (3, 0, Z, b, Y) (3, 0, Z, c, X) (1, 4, Y, d, Z) (0, 4, Y, d, Z) (2, 4, Y, d, Z) v4 X a Z v1 X b v2 X v2 Y c d b v1 X b a Z 6 v0 v3 )b( c a Z v3 )c( Z v4
  • 49. Minimum DFS-Code • The minimum DFS code min(G), in DFS lexicographic order, is a canonical representation of graph G. • Graphs A and B are isomorphic if and only if: min(A) = min(B)
  • 50. DFS-Code Tree: parent-child relation • If min(G1) = { a0, a1, ….., an} and min(G2) = { a0, a1, ….., an, b} – G1 is parent of G2 – G2 is child of G1 • A valid DFS code requires that b grows from a vertex on the rightmost path (inherited property from the DFS search).
  • 51. v0 Graph G1: a a X Y v1 b d X b c v2 Z v3 Z v4 ) = Min(g )X,a,Y,0,1( )Y,b,X,1,2( )X,a,X,2,0( )X,c,Z,2,3( )Z,b,Y,3,1( )Y,d,Z,1,4( A child of graph G1 must grow edge from rightmost path of G1 (necessary condition) v0 Graph G2: a wrong a ? X Y v1 ? b b v5 ? ? ? v5 v0 a ? v5 c v2 Z v3 Z v4 Forward EDGE ? ? ? v5 b ? Y v1 b d X a X d X c v2 Z v3 Z v4 Backward EDGE
  • 52. Search space: DFS code Tree • Organize DFS Code nodes as parent-child. • Sibling nodes organized in ascending DFS lexicographic order. • InOrder traversal follows DFS lexicographic order!
  • 53. 0 A 1 A C … C 3 2 Min DFS-Code 0 0 A 1 A C A 1 C 2 1 2 B 2 1 B 2 C 0 A 1 A 1 2 0 A 1 S 1 C 1 0 B 1 1 C 3 C 1 0 C C 2 1 C 2 0 C 1 C 2 3 2 0 B B 2 B 2 0 B 2 C C 1 B 1 3 C 0 B C 0 A B 0 2 B 3 0 A 1 C 3 0 A 0 A ’S A 3 PRUNED 0 0 A Not Min DFS-Code
  • 54. Tree pruning • All of the descendants of infrequent node are infrequent also. • All of the descendants of a not minimal DFS code are also not minimal DFS codes.
  • 55. part1: defining the Tree Search Space (TSS) Part2: gSpan Finds all frequent graphs by Exploring TSS
  • 56. gSpan Algorithm gSpan(D, F, g) 1: if g ≠ min(g) return; 2: F ← F ∪ { g } 3: children(g) ← [generate all g’ potential children with one edge growth]* 4: Enumerate(D, g, children(g)) 5: for each c ∈ children(g) if support(c) ≥ #minSup SubgraphMining (D, F, c) ___________________________ * gSpan improve this line
  • 57. Example Given: database D T1 c T2 a b c b a c a c c b c a T3 b a a c a c Task: Mine all frequent subgraphs with support ≥ 2 (#minSup) b a b
  • 58. T1 a b c b a c c b a c 0 A A 1 C 2 a c T3 b a c a TID={1,3{ C 3 A A C 2 {TID={1,3 TID={1,2,3{ 0 A 0 A A 1 {TID={1,2,3 1 2 0 A 1 B TID={1,2,3{ 0 A 1 2 C 0 B 1 a c TID={1,3{ 0 1 c T2 0 C 1 b a b
  • 59. T1 c T2 a b b a c c c b a c a c T3 b a c a c 0 A A 1 C C 3 2 TID={1,2{ 0 1 0 A A C 2 1 A C B 2 TID={1,2,3{ 0 A 0 A A 1 1 2 0 A 1 B TID={1,2,3{ 0 A 1 2 C 0 B 1 a 0 C 1 b a b
  • 60. T1 c T2 a b b a c c c b a c T3 a b c a a c a b c b 0 A A 1 C C 3 2 0 0 1 0 A A C 2 1 1 A C 2 2 1 2 B A 1 C 3 0 A 0 A 1 C 0 A 1 B 2 B 0 A A 1 0 A B C 2 3 0 A 1 2 0 C 3 0 B 1 C 2 0 B 1 B 0 B 1 C 2 B 3 0 B 1 2 C 0 C 1 B 1 2 a C C 3 0 C 1 C 2
  • 61. gSpan Performance • On synthetic datasets it was 6-10 times faster than FSG. • On Chemical compounds datasets it was 15-100 times faster! • But this was comparing to OLD versions of FSG!
  • 62. Different Approaches for GM • Apriori Approach – FSG – Path Based • DFS Approach – gSpan • Greedy Approach – Subdue D. J. Cook and L. B. Holder Graph-Based Data Mining Tech. report, Department of CS Engineering, 1998
  • 63. Graph Pattern Explosion Problem • If a graph is frequent, all of its subgraphs are frequent ─ the Apriori property • An n-edge frequent graph may have 2n subgraphs. • Among 422 chemical compounds which are confirmed to be active in an AIDS antiviral screen dataset, there are 1,000,000 frequent graph patterns if the minimum support is 5%.
  • 64. Subdue algorithm • A greedy algorithm for finding some of the most prevalent subgraphs. • This method is not complete, i.e. it may not obtain all frequent subgraphs, although it pays in fast execution.
  • 65. Subdue algorithm (Cont.) • It discovers substructures that compress the original data and represent structural concepts in the data. • Based on Beam Search - like BFS it progresses level by level. Unlike BFS, however, beam search moves downward only through the best W nodes at each level. The other nodes are ignored.
  • 66. Subdue algorithm: step 1 Step 1: Create substructure for each unique vertex label Substructures: DB: triangle on left square circle on rectangle on on on left left triangle triangle triangle on on on left left square square square triangle (4) square (4) circle (1) rectangle (1)
  • 67. Subdue Algorithm: step 2 Step 2: Expand best substructure by an edge or edge and neighboring vertex Substructures: DB: triangle on square left circle on rectangle on on on left left triangle triangle triangle on on on left left square square square square left square on rectangle circle rectangle on triangle triangle on square
  • 68. Subdue Algorithm: steps 3-5 Step 3: Keep only best substructures on queue (specified by beam width). Step 4: Terminate when queue is empty or when the number of discovered substructures is greater than or equal to the limit specified. Step 5: Compress graph and repeat to generate hierarchical description.
  • 69. Outline • Introduction • Motivation and applications for Graph mining • Mining Frequent Subgraphs – Transaction setting – BFS/Apriori Approach (FSG and others) – DFS Approach (gSpan and others) – Greedy Approach • Mining Frequent Subgraphs – Single graph setting – The support issue – Path mining algorithm – Constraint-based mining
  • 70. Single Graph Setting Most existing algorithms use a transaction setting approach. That is, if a pattern appears in a transaction even multiple times it is counted as 1 (FSG, gSPAN ). What if the entire database is a single graph? This is called single graph setting. We need a different support definition!
  • 71. Single graph setting - Motivation    Often the input is a single large graph. Examples:  The web or portions of it.  A social network (e.g. a network of users communicating by email at BGU).  A large XML database such as DBLP or Movies database. Mining large graph databases is very useful.
  • 72. Support issue Support measure is admissible if for any pattern P and any sub-pattern Q ⊂ P support of P is not larger than support of Q. Problem: the number of pattern appearances is not good!
  • 73. Support issue An instance graph of pattern P in database graph D is a graph whose nodes are pattern instances in D and they are connected by an edge when corresponding instances share an edge.
  • 74. Support issue Operations on instance graph: • clique contraction: replace clique C by a single node c. Only the nodes adjacent to each node of C may be adjacent to c.  node expansion: replace node v by a new subgraph whose nodes may or may not be adjacent to the nodes adjacent to v.  node addition: add a new node to the graph and arbitrary edges between the new node and the old ones.  edge removal : remove an edge.
  • 75. The main result Theorem. A support measure S is an admissible support measure if and only if it is non-decreasing on instance graph of every pattern P under clique contraction, node expansion, node addition and edge removal.
  • 76. Example of support measure - MIS MIS Maximum independent set size of instance graph = _____________________________________ Number of edges in the database graph
  • 77. Path mining algorithm (Vanetik, Gudes, Shimony)  Goal: find all frequent connected subgraphs of a database graph.  Basic approach: Apriori or BFS.  The basic building block is a path not an edge.  This works since any graph can be decomposed into edge-disjoint paths.  Result: faster convergence of the algorithm.
  • 78. Path-based mining algorithm • The algorithm uses paths as basic building blocks for pattern construction. • It starts with one-path graphs and combines them into 2-, 3etc. path graphs. • The combination technique does not use graph operations and is easy to implement. • Path number of a graph is computed in linear time: it is the number of odd-degree vertices divided by two. • Given minimal path cover P, removal of one path creates a graph with minimal path cover size |P|-1. • There exist at least two paths in P whose removal leaves the graph connected.
  • 79. More than one path cover for graph 1. Define a descriptor of each path based on node labels and node degrees. 2. Use lexicographical order among descriptors to compare between paths. 3. One graph can have several minimal path covers. 4. We only use path covers that are minimal w.r.t. lexicographical order. 5. Removal of path from a lexicographically minimal path cover leaves the cover lexicographically minimal.
  • 80. Example: path descriptors P1 = v1,v2,v3,v4,v5 P2 = v1,v5,v2 Desc (P1) = <A,0,1>,<A,1,1>,<B,1,0>,<C,1,1>,<D,1,1> Desc (P2) = <A,0,1>,<A,1,0>,<B,1,1>
  • 81. Path mining algorithm  Phase 1: find all frequent 1-path graphs.  Phase 2: find all frequent 2-path graphs by “joining” frequent 1-path graphs.  Phase 3: find all frequent k-path graphs, k≥3, by “joining” pairs of frequent (k-1)-path graphs. Main challenge: “join” must ensure soundness and completeness of the algorithm.
  • 82. Graph as collection of paths: table representation Node P2 P3 v1 Graph composed from 3 paths: P1 a1 ⊥ ⊥ v2 a2 b2 ⊥ v3 a3 ⊥ ⊥ v4 ⊥ b1 ⊥ v5 ⊥ b3 c3 v6 ⊥ ⊥ c1 v7 ⊥ ⊥ c2
  • 83. Removing path from table Node P1 P2 P3 v1 a1 ⊥ ⊥ v2 a2 b2 ⊥ v3 a3 ⊥ v4 ⊥ v5 Node P1 P2 v1 a1 ⊥ ⊥ v2 a2 b2 b1 ⊥ v3 a3 ⊥ ⊥ b3 c3 v4 ⊥ b1 v6 ⊥ ⊥ c1 v5 ⊥ b3 v7 ⊥ ⊥ c2 delete P3
  • 84. Join graphs with common paths: the sum operation C1 C3 C2 P1 P2 P3 P1 P2 P4 v1 a1 v2 a2 b2 + v3 a3 v3 a3 v4 b1 v4 b1 d1 v5 b3 c3 v5 b3 v6 c1 v6 d2 v7 c2 v7 d3 Join on P1,P2 v1 a1 v2 a2 v3 v1 a1 v2 a2 b2 P1 P2 P 3 P 4 a3 b2 v4 b1 d1 v4 b3 c3 v6 c1 v7 c2 v8 d2 v9 d3
  • 85. The sum operation: how it looks on graphs
  • 86. The sum is not enough: the splice operation • • • • • • We need to construct a frequent n-path graph G on paths P1,…,Pn. We have two frequent (n-1)-path graphs, G1 on paths P1, …,Pn-1 and G2 on paths P2,…,Pn. The sum of G1 and G2 will give us n-path graph G’ on paths P1,…,Pn. G’=G if P1 and Pn have no common node that belongs solely to them. A frequent 2-path graph H containing P1 and Pn exactly as they appear in G exists if G is frequent. Let us join the nodes of P1 and Pn in G’ according to H. This is the splice operation!
  • 87. The splice operation: an example G1 G3 P2 P3 v1 v1 v2 v2 v3 v3 v4 v4 b1 v5 v5 b3 c3 v6 v6 c1 v7 v7 c2 G2 P1 P2 P3 P2 P3 v1 a1 v2 b1 c1 v2 a2 v4 Splice G1 with G2 b2 b2 v3 a3 v5 b3 c3 v4 b1 c1 v7 v5 b3 c3 c2 v6 b2 c2
  • 88. The splice operation: how it looks on graph
  • 89. Labeled graphs: we mind the labels We join only nodes that have the same labels!
  • 90. Path mining algorithm 1. 2. 3. 4. 5. Find all frequent edges. Find frequent paths by adding one edge at a time (not all nodes are suitable for this!) Find all frequent 2-path graphs by exhaustive joining. Set k=2. While frequent k-path graphs exist: a) Perform sum operation on pairs of frequent k-path graphs where applicable. b) Perform splice operation on generated (k+1)-path candidates To get additional (k+1)-path candidates. c) Compute support for (k+1)-path candidates. d) Eliminate non-frequent candidates and set k:=k+1. e) Go to 5.
  • 91. Complexity Exponential – as the number of frequent patterns can be exponential on the size of the database (like any Apriori alg.) Difficult tasks: (NP hard) 1. Support computation that consists of: a. Finding all instances of a frequent pattern in the database. (sub-graph isomorphism) b. Computing MIS (maximum independent set size) of an instance graph. Relatively easy tasks: 1. Candidate set generation: polynomial on the size of frequent set from previous iteration, 2. Elimination of isomorphic candidate patterns: graph isomorphism computation is at worst exponential on the size of a pattern, not the database.
  • 92. Additional Approaches for Single Graph Setting • BFS Approach – hSiGram • DFS Approach – vSiGram M. Kuramochi and G. Karypis Finding Frequent Patterns in a Large Sparse Graph In Proc. Of SIAM 2004. • Both use approximations of the MIS measure
  • 93.
  • 94.
  • 95.
  • 96.
  • 97. Conclusions • Data Mining field proved its practicality during its short lifetime with effective DM algorithms. • Many applications in Databases, Chemistry&Biology, Networks, etc. • Both Transaction and Single graph settings are important • Graph Mining is: – Dealing with designing effective algorithms for mining graph datasets. – Facing many hardness problems on the way. – Fast growing field with many possibilities of evolving unseen before. • As more and more information is stored in complicated structures, we need to develop new set of algorithms for Graph Data Mining.
  • 98. Some References [1] T. Washio A. Inokuchi and H.~Motoda, An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data, Proceedings of the 4th PKDD'00, 2000, pages 13-23. [2] M. Kuramochi and G. Karypis, An Efficient Algorithm for Discovering Frequent Subgraphs, Tech. report, Department of Computer Science/Army HPC Research Center, 2002. [3] N. Vanetik, E.Gudes, and S. E. Shimony, Computing Frequent Graph Patterns from Semistructured Data, Proceedings of the 2002 IEEE ICDM'02 [4] Y. Xifeng and H. Jiawei, gspan: Graph-Based Substructure Pattern Mining, Tech. report, University of Illinois at Urbana-Champaign, 2002. [5] W. Wang J. Huan and J. Prins, Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism, Proceedings of the 3rd IEEE ICDM'03 p.~549. [6] Moti Cohen, Ehud Gudes, Diagonally Subgraphs Pattern Mining. DMKD 2004, pages 51-58, 2004 [7] D. J. Cook and L. B. Holder, Graph-Dased Data Mining, Tech. report, Department of CS Engineering, 1998.
  • 99. Documents Classification: Alternative Representation of Multilingual Web Documents:The Graph-Based Model Introduced in A. Schenker, H. Bunke, M. Last, A. Kandel, Graph-Theoretic Techniques for Web Content Mining, World Scientific, 2005
  • 100. The Graph-Based Model of Web Documents • Basic ideas: – One node for each unique term – If word B follows word A, there is an edge from A to B • In the presence of terminating punctuation marks (periods, question marks, and exclamation points) no edge is created between two words – Graph size is limited by including only the most frequent terms – Several variations for node and edge labeling (see the next slides) • Pre-processing steps – Stop words are removed – Lemmatization • Alternate forms of the same term (singular/plural, past/present/future tense, etc.) are conflated to the most frequently occurring form
  • 101. The Standard Representation • Edges are labeled according to the document section where the words are followed by each other – Title (TI) contains the text related to the document’s title and any provided keywords (meta-data); – Link (L) is the “anchor text” that appears in clickable hyper-links on the document; – Text (TX) comprises any of the visible text in the document (this includes anchor text but not title and keyword text) TI YAHOO L NEWS TX TX REPORTS SERVICE TX MORE REUTERS
  • 102. Graph Based Document Representation – Detailed Example Source: www.cnn.com, May 24, 2005
  • 103. Graph Based Document Representation - Parsing title link text
  • 104. Standard Graph Based Document Representation TX Ten most frequent terms are used CAR Word 3 Killing 2 Bomb 2 Wounding 2 Driver 2 Exploded 1 Baghdad 1 Internationa l 1 CNN 1 Car 1 TX Frequency Iraqis KILLING DRIVER TX Text TX L BOMB TX Link EXPLODED IRAQIS TX BAGHDAD TX WOUNDING Title INTERNATIONAL TI CNN
  • 105. Classification using graphs • Basic idea: – Mine the frequent sub-graphs, call them terms – Use TFIDF for assigning the most characteristic terms to documents – Use Clustering and K-nearest neighbors classification
  • 106. Subgraph Extraction • Input – G – training set of directed, unique nodes graphs – CRmin - Minimum Classification Rate • Output – Set of classification-relevant sub-graphs • Process: – For each class find subgraphs CR > CRmin – Combine all sub-graphs into one set • Basic Assumption – Classification-Relevant Sub-Graphs are more frequent in a specific category than in other categories
  • 107. Computing the Classification Rate • Subgraph Classification Rate: CR ( g ′k ( ci ) ) = SCF ( g ′k ( ci ) ) × ISF ( g ′k ( ci ) ) • SCF (g’k(ci)) - Subgraph Class Frequency of subgraph g’k in category ci • ISF (g’k(ci)) - Inverse Subgraph Frequency of subgraph g’k in category ci • Classification Relevant Feature is a feature that best explains a specific category, or frequent in this category more than in all others
  • 108. k-Nearest Neighbors with Graphs Accuracy vs. Graph Size 86% Classification Accuracy 82% 78% 74% 70% 1 2 3 4 5 6 7 8 9 Number of Nearest Neighbors (k) Vector model (cosine) Graphs (70 nodes/graph) Vector model (Jaccard) Graphs (100 nodes/graph) Graphs (40 nodes/graph) Graphs (150 nodes/graph) 10

Notas do Editor

  1. Apriori: Step1: Join two k-1 edge graphs (these two graphs share a same k-2 edge subgraph) to generate a k-edge graph Step2: Join the tid-list of these two k-1 edge graphs, then see whether its count is larger than the minimum support Step3: Check all k-1 subgraph of this k-edge graph to see whether all of them are frequent Step4: After G successfully pass Step1-3, do support computation of G in the graph dataset, See whether it is really frequent. gSpan: Step1: Right-most extend a k-1 edge graph to several k edge graphs. Step2: Enumerate the occurrence of this k-1 edge graph in the graph dataset, meanwhile, counting these k edge graphs. Step3: Output those k edge graphs whose support is larger than the minimum support. Pros: 1: gSpan avoid the costly candidate generation and testing some infrequent subgraphs. 2: No complicated graph operations, like joining two graphs and calculating its k-1 subgraphs. 3. gSpan is very simple The key is how to do right most extension efficiently in graph. We invented DFS code for graph.
  2. one node for each unique term: unique labeling of nodes – no two nodes have the same label
  3. In next few slides we will show simple example of graph document representation process. Here we have real CNN news web page that were recently downloaded at may 24/2005. The picture is for esthetic purpose only. We still don’t use pictures and other multimedia for graph construction. Only text and some extracted HTML tags information are used.
  4. During the parsing stage we recognize relevant for this particular representation sections of the document using information provided by HTML tags. We interested in text of title, links and document body that is normal, visible to user text.
  5. one node for each unique term: unique labeling of nodes – no two nodes have the same label
  6. Second, smart method assume that relevant for classification feature is that frequent in one category and unfrequent in others. This is better assumption then we used in naïve case. We developed measure called Classification Rate that give higher weight to such subgraphs. Extraction process is pretty similar to naïve process. CR min is used as threshold.
  7. Vector models are represented by dotted lines, graph models are represented by solid lines