RuleML2015: Learning Characteristic Rules in Geographic Information Systems

Learning Characteristic Rules in Geographic
Information Systems
A. Salleb-Aouissi 1, C. Vrain 2, D. Cassard 3
1CCLS - Columbia University - New York
2LIFO - Université d’Orléans - France
3French Geological Survey (BRGM)
RuleML 2015
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 1 / 24

Plan
1 Introduction
2 Distance-based characteristic rules
3 Experiments

Plan
1 Introduction
3 Experiments

The characterization task
Characterization: a descriptive data mining task
given a target set of objets (denoted by X0)
⇒
ﬁnd a description of these objects
X0 → p (measure)
A set of movies (for instance the movies produced by S. Spielberg)
Movie(Sp) → date ∈ [1974, 2010](86%)
Main advantages
focused on a set of positive examples
negative examples can be used to focus on important properties
⇒ Supervised Descriptive Rule Discovery: mining emergent patterns,
subgroup discovery, mining contrast set
⇒ differs from discrimination and classiﬁcation

Extension to relational databases [PKDD03]
An intermediate language based on existential and universal
quantiﬁers
A set of movies (movies produced by S. Spielberg)
A relation between movies and awards
Movie(Sp) → ∃Award Award.kind in {Oscar, GoldenPalm}(25%)
Movie(Sp) → ∀Award Award.kind in {Oscar, GoldenPalm}(10%)
X0 → Q1 X1 . . . Qn Xn p
X0: the target objects
Xi: a type of objects
there exists a relation between Xi−1 and Xi
Qi = ∀ or ∃
The quantiﬁer can be indexed by the name of the relation if needed.

Contributions
Extension of the work presented in [PKDD 03] for relational
databases
⇒ Flexible quantiﬁers: ∃e
, ∀f
Movie(Sp) → ∃2
Actor Actor.nationality = French (xxx%)
Movie(Sp) → ∀20%
Actor Actor.nationality = French (xxx%)
⇒ Application to GIS: management of spatial data and spatial
relations between objects
Introduction of distance-based relations for GIS → allows to model
spatial buffers around objects, as suggested in [PKDD 03]
Extension of the generality relation between rules
People → ∃Movie ∃Award p People → ∀Movie ∃Award p
∃2
10KmFault ∃2
5KmFault ∃3
3KmFault
Experiments on a SIG Andes with an interactive algorithm

Geographic Information Systems
A GIS allows to handle geographic, spatially
referenced data: a position and a shape in
the space.
→ organization into thematic layers, linked by
geography
→ descriptions of the geographical objects by
attribute-value tables
⇒ Experiments on a homogeneous GIS, a tool for mineral exploration and
development
extending for some 8,500 km long, from the Guajira Peninsula (northern
Colombia) to Cape Horn (Tierra del Fuego) → an area of 3.83 million km2
more than 70 thousands geographic objects
geographic, geologic, seismic, volcanic, mineralogy, gravimetric, . . . layers
mines, volcanos, faults

Plan
1 Introduction
3 Experiments

Speciﬁcation of the characterization task
Inputs
E: a set of geographic objects organized into layers
E = E1 ∪ E2 · · · ∪ En, where each Ei represents a set of objects with
the same type Ti.
A set of attributes for each type of objects; objects are described
by attribute-value pairs
Two kinds of relations between objects
classical relations between objects: intersect, overlap, . . .
rλ
ij for each type of objects Ei and Ej .
rλ
ij (oi , oj ) is true when d(oi , oj ) ≤ λ
A measure: support, novelty, . . .

Distance quantiﬁed paths
X0 − Q1 X1 . . . Qn Xn
where
n ≥ 0
X0 represents the target set of objects to characterize,
for each i = 0, Xi is a type of objects,
for each i = 0, Qi is either: ∀f
rij
, ∃e
rij
, ∀f
λ, ∃e
λ
f is a percentage (f = 0),
e is a natural number (e = 0)
the indexation by λ stands for the distance relation rλ
(i−1)i between
Xi−1 and Xi
∀100% (resp. ∃1) stands for ∀ (resp. ∃).

Language of properties
Given for each type Ti,
a language Li specifying the properties that can be built
a boolean function V, determining for each object o of type Ti and
for each property p in Li whether Vp(o) = true or Vp(o) = false
A geographic characteristic rule on a target set X0
a conjunction of a distance quantiﬁed path δ and a property p
X0 − δ → p
Mines − ∃3
5km Faults → True: there exist at least 3 Faults within 5km of
the a target object (mineral deposits).
Mines − ∃1
1km Volcano → (active=yes): there exist at least one active
volcano within 1km of a target object (mineral deposits).

Generality order between paths
Let δ1 and δ2 be two distance quantiﬁed paths.
δ1 is more general than δ2 (δ1 δ2) iff
length(δ1) = length(δ2)
δ1 and δ2 involve the same type of objects in the same order
for 1 ≤ i ≤ length(δ1), either:
Q1
i ≡ Q2
i , or
Q1
i = ∃rij
and Q2
i = ∀rij
Q1
i = ∃λ and Q2
i = ∀λ
Q1
i = ∃e
rij
and Q2
i = ∃e
rij
, with e ≤ e
Q1
i = ∃e
λ and Q2
i = ∃e
λ , with λ ≥ λ and e ≤ e

Generality order between rules
δ1 → p1 is more general than δ2 → p2 (r1 r2) iff
either δ1 δ2 and p1 p2,
or length(δ1) < length(δ2), δ1 is more general than the preﬁx of δ2
with length equal to length(δ1) and p1 = True.
∃2
10KmFault ∃2
5KmFault ∃3
3KmFault
True is more general than ∃2
10KmFault
We have ∀3KmFault ∀5KmFault ∀10KmFault but no relation
between ∀40%
5KmFault and ∀20%
10KmFault.

Notion of coverage
Let o an objet and let δ → p be a rule.
δ is decomposed into QλX.δ and we consider the objects o1, . . . on of
type X at a distance less than λ from o.
If n = 0 (no objects of X at a distance less than λ from o)
V∀f
λX.δ →p(o) = V∃e
λX.δ →p(o) = False
V∀f
λX.δ →p(o) = True if
|{oi |Vδ →p(oi )=True}|
n ≥ f , False otherwise
V∃e
λX.δ →p(o) = True if |{oi|Vδ →p(oi) = True}| ≥ e, False
otherwise.
Let us notice that
V∀λX.δ →p(o) = Vδ →p(o1) ∧ · · · ∧ Vδ →p(on)
V∃λX.δ →p(o) = Vδ →p(o1) ∨ · · · ∨ Vδ →p(on)
The same deﬁnition easily extends to a relation rij by considering
the objects o1, . . . on linked to o by the relation rij.

Geographic Information Systems
Let Etarg a given target set of objects
coverage(r, Etarg) =
{o|o ∈ Etarg, Vr (o) = true}
Etarg
Proposition. Let r1 (δ1 → p1) and r2 (δ2 → p2) be two geographic rules
then
r1 r2 ⇒ coverage(r1, Etarg) ≥ coverage(r2, Etarg)
Corollary: If r1 is not frequent, r2 is not frequent.

Link-coverage
Deﬁnition of the link-coverage of a rule r (δ → p):
L-coverage(r, Etarg) = coverage(open(δ) → True, Etarg)
where open(δ) is obtained by setting all the quantiﬁers of δ to ∃ (with
no constraint on the number of elements).
Proposition:
If L-coverage(r, Etarg) ≤ then coverage(r, Etarg) ≤
Corollary:
If open(δ) → True is not frequent, then all its specializations are not
frequent.

SIGMiner
Input:
- Etarg, Ei , Pi , i ∈ {1..n}
- Rij binary relations between Ei and Ej , i, j ∈ {1..n}
- MinCov.
Output:
- A set of characterization rules R and a tree representing the rules.
QP =empty string, response=T
while response do
Choose a quantiﬁer q ∈ {∀, ∃}
Choose a buffer λ or a relation ri,j
Choose a parameter k for the quantiﬁer
Choose a set of objects Ej ∈ {Ei , i ∈ {1..n}}
QP = QP.Qk
λ Ej
if L-coverage(Etarg − QP → True) ≥ MinCov then
foreach property p ∈ Pj do
if coverage(Etarg − QP → p, Etarg) ≥ MinCov then
if interesting(Etarg − QP → p) then
R=R ∪ {Etarg − QP → p}
if user no longer wishes to extend QP then
response=F

Plan
1 Introduction
3 Experiments

GIS Andes
Figure: Database schema of GIS Andes. Links represent an “is_distant”
relationship.
Pre-computation of the distance between objects, given a large
distance thresold
Pre-computation of relation tables between objects
Only rules with |novelty| ≥ 0.05 are kept.
novelty(r) =
|{o|o∈Etarg, Vr (o)=true}|
|E| -
|Etarg|
|E| · |{o|o∈E, Vr (o)=true}|
|E|

An example
Figure: Example of tree exploration in GISMiner.

Classical learned rules
Rule Coverage
Mines → Mines.Era ∈ {Mesozoic, Cretacious} 4%
Mines → Mines.Era ∈ {Mesozoic, Jurassic, Cretacious} 6%
Mines → Mines.Lithology = sedimentary deposits 5%
Mines → Mines.Lithology = volcanic deposits 64%
Mines → Mines.Distance_Benioff ∈ [170..175] 67%
Minesgold → substance = Gold/Copper 12%
Minesgold → Country = Peru 31%
Minesgold → Country = Chile 16%
Minesgold → Country = Argentina 22%
Minesgold → Morphology = Present − dayorrecentplacers 16%
Minesgold → Morphology = Discordantlodeorvein(thickness > 50cm), · · · 30%
Minesgold → Gitology = Alluvial − eluvialplacers 14%

More complex rules
Rule Coverage
Minesgold − ∃1
10kmGeology → True 95%
Minesgold − ∃1
10kmGeology → Geology.Age ∈ {Cenozoic, Tertiary} 58%
Minesgold − ∃1
10kmGeology → Geology.Age ∈ {Cenozoic, Quaternary} 40%
Minesgold − ∃1
10kmGeology → Geology.Age = Paleozoic 38%
Minesgold − ∃1
10kmGeology → Geology.System = Neogene 41%
Minesgold − ∃1
10kmGeology → Geology.GeolType = Sedimentary 35%
Minesgold − ∃1
15kmFaults → True 63%
Minesgold − ∃2
Minesgold − ∃3
Minesgold − ∀75%
10kmGeology∃1
20kmFault → True 58%

Conclusion
Extension of the framework based on quantified paths
Introduction of distance-based relations for GIS
⇒ allows to model spatial buffers around objects, as suggested in
[PKDD 03]
Introduction of flexible operators ∃e
and ∀f
allowing much more
interesting rules
⇒ ∃e
is more interesting than ∀f
from the point of view of generality
An interactive algorithm for mining distance based geographic
rules.
In progress, an implementation of a relational rule mining system
performing a breadth-first search.
Interest of the formalism for learning in Description Logics?

Links with description logics
Let X0 − QR0
X1 . . . QRn−1
Xn → p, we associate
the atomic concept Xi to each type of object Xi
the role Ri to each relation Ri linking Xi to Xi+1
the concept P to the property p
quantiﬁed path + property representation in DL
∅ p P
∀Xi p Xi ∀Ri .P
∃Xi p Xi ∃Ri .P
∃e is a cardinality constraint.

RuleML2015: Learning Characteristic Rules in Geographic Information Systems

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a RuleML2015: Learning Characteristic Rules in Geographic Information Systems

Semelhante a RuleML2015: Learning Characteristic Rules in Geographic Information Systems (20)

Mais de RuleML

Mais de RuleML (20)

Último

Último (20)

RuleML2015: Learning Characteristic Rules in Geographic Information Systems