We provide a general framework for learning characterization
rules of a set of objects in Geographic Information Systems (GIS) relying
on the definition of distance quantified paths. Such expressions specify
how to navigate between the different layers of the GIS starting from
the target set of objects to characterize. We have defined a generality
relation between quantified paths and proved that it is monotonous with
respect to the notion of coverage, thus allowing to develop an interactive
and effective algorithm to explore the search space of possible rules. We
describe GISMiner, an interactive system that we have developed based
on our framework. Finally, we present our experimental results from a
real GIS about mineral exploration.
RuleML2015: Learning Characteristic Rules in Geographic Information Systems
1. Learning Characteristic Rules in Geographic
Information Systems
A. Salleb-Aouissi 1, C. Vrain 2, D. Cassard 3
1CCLS - Columbia University - New York
2LIFO - Université d’Orléans - France
3French Geological Survey (BRGM)
RuleML 2015
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 1 / 24
4. The characterization task
Characterization: a descriptive data mining task
given a target set of objets (denoted by X0)
⇒
find a description of these objects
X0 → p (measure)
A set of movies (for instance the movies produced by S. Spielberg)
Movie(Sp) → date ∈ [1974, 2010](86%)
Main advantages
focused on a set of positive examples
negative examples can be used to focus on important properties
⇒ Supervised Descriptive Rule Discovery: mining emergent patterns,
subgroup discovery, mining contrast set
⇒ differs from discrimination and classification
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 4 / 24
5. Extension to relational databases [PKDD03]
An intermediate language based on existential and universal
quantifiers
A set of movies (movies produced by S. Spielberg)
A relation between movies and awards
Movie(Sp) → ∃Award Award.kind in {Oscar, GoldenPalm}(25%)
Movie(Sp) → ∀Award Award.kind in {Oscar, GoldenPalm}(10%)
X0 → Q1 X1 . . . Qn Xn p
X0: the target objects
Xi: a type of objects
there exists a relation between Xi−1 and Xi
Qi = ∀ or ∃
The quantifier can be indexed by the name of the relation if needed.
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 5 / 24
6. Contributions
Extension of the work presented in [PKDD 03] for relational
databases
⇒ Flexible quantifiers: ∃e
, ∀f
Movie(Sp) → ∃2
Actor Actor.nationality = French (xxx%)
Movie(Sp) → ∀20%
Actor Actor.nationality = French (xxx%)
⇒ Application to GIS: management of spatial data and spatial
relations between objects
Introduction of distance-based relations for GIS → allows to model
spatial buffers around objects, as suggested in [PKDD 03]
Extension of the generality relation between rules
People → ∃Movie ∃Award p People → ∀Movie ∃Award p
∃2
10KmFault ∃2
5KmFault ∃3
3KmFault
Experiments on a SIG Andes with an interactive algorithm
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 6 / 24
7. Geographic Information Systems
A GIS allows to handle geographic, spatially
referenced data: a position and a shape in
the space.
→ organization into thematic layers, linked by
geography
→ descriptions of the geographical objects by
attribute-value tables
⇒ Experiments on a homogeneous GIS, a tool for mineral exploration and
development
extending for some 8,500 km long, from the Guajira Peninsula (northern
Colombia) to Cape Horn (Tierra del Fuego) → an area of 3.83 million km2
more than 70 thousands geographic objects
geographic, geologic, seismic, volcanic, mineralogy, gravimetric, . . . layers
mines, volcanos, faults
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 7 / 24
9. Specification of the characterization task
Inputs
E: a set of geographic objects organized into layers
E = E1 ∪ E2 · · · ∪ En, where each Ei represents a set of objects with
the same type Ti.
A set of attributes for each type of objects; objects are described
by attribute-value pairs
Two kinds of relations between objects
classical relations between objects: intersect, overlap, . . .
rλ
ij for each type of objects Ei and Ej .
rλ
ij (oi , oj ) is true when d(oi , oj ) ≤ λ
A measure: support, novelty, . . .
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 9 / 24
10. Distance quantified paths
X0 − Q1 X1 . . . Qn Xn
where
n ≥ 0
X0 represents the target set of objects to characterize,
for each i = 0, Xi is a type of objects,
for each i = 0, Qi is either: ∀f
rij
, ∃e
rij
, ∀f
λ, ∃e
λ
f is a percentage (f = 0),
e is a natural number (e = 0)
the indexation by λ stands for the distance relation rλ
(i−1)i between
Xi−1 and Xi
∀100% (resp. ∃1) stands for ∀ (resp. ∃).
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 10 / 24
11. Language of properties
Given for each type Ti,
a language Li specifying the properties that can be built
a boolean function V, determining for each object o of type Ti and
for each property p in Li whether Vp(o) = true or Vp(o) = false
A geographic characteristic rule on a target set X0
a conjunction of a distance quantified path δ and a property p
X0 − δ → p
Mines − ∃3
5km Faults → True: there exist at least 3 Faults within 5km of
the a target object (mineral deposits).
Mines − ∃1
1km Volcano → (active=yes): there exist at least one active
volcano within 1km of a target object (mineral deposits).
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 11 / 24
12. Generality order between paths
Let δ1 and δ2 be two distance quantified paths.
δ1 is more general than δ2 (δ1 δ2) iff
length(δ1) = length(δ2)
δ1 and δ2 involve the same type of objects in the same order
for 1 ≤ i ≤ length(δ1), either:
Q1
i ≡ Q2
i , or
Q1
i = ∃rij
and Q2
i = ∀rij
Q1
i = ∃λ and Q2
i = ∀λ
Q1
i = ∃e
rij
and Q2
i = ∃e
rij
, with e ≤ e
Q1
i = ∃e
λ and Q2
i = ∃e
λ , with λ ≥ λ and e ≤ e
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 12 / 24
13. Generality order between rules
δ1 → p1 is more general than δ2 → p2 (r1 r2) iff
either δ1 δ2 and p1 p2,
or length(δ1) < length(δ2), δ1 is more general than the prefix of δ2
with length equal to length(δ1) and p1 = True.
∃2
10KmFault ∃2
5KmFault ∃3
3KmFault
True is more general than ∃2
10KmFault
We have ∀3KmFault ∀5KmFault ∀10KmFault but no relation
between ∀40%
5KmFault and ∀20%
10KmFault.
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 13 / 24
14. Notion of coverage
Let o an objet and let δ → p be a rule.
δ is decomposed into QλX.δ and we consider the objects o1, . . . on of
type X at a distance less than λ from o.
If n = 0 (no objects of X at a distance less than λ from o)
V∀f
λX.δ →p(o) = V∃e
λX.δ →p(o) = False
V∀f
λX.δ →p(o) = True if
|{oi |Vδ →p(oi )=True}|
n ≥ f , False otherwise
V∃e
λX.δ →p(o) = True if |{oi|Vδ →p(oi) = True}| ≥ e, False
otherwise.
Let us notice that
V∀λX.δ →p(o) = Vδ →p(o1) ∧ · · · ∧ Vδ →p(on)
V∃λX.δ →p(o) = Vδ →p(o1) ∨ · · · ∨ Vδ →p(on)
The same definition easily extends to a relation rij by considering
the objects o1, . . . on linked to o by the relation rij.
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 14 / 24
15. Geographic Information Systems
Let Etarg a given target set of objects
coverage(r, Etarg) =
{o|o ∈ Etarg, Vr (o) = true}
Etarg
Proposition. Let r1 (δ1 → p1) and r2 (δ2 → p2) be two geographic rules
then
r1 r2 ⇒ coverage(r1, Etarg) ≥ coverage(r2, Etarg)
Corollary: If r1 is not frequent, r2 is not frequent.
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 15 / 24
16. Link-coverage
Definition of the link-coverage of a rule r (δ → p):
L-coverage(r, Etarg) = coverage(open(δ) → True, Etarg)
where open(δ) is obtained by setting all the quantifiers of δ to ∃ (with
no constraint on the number of elements).
Proposition:
If L-coverage(r, Etarg) ≤ then coverage(r, Etarg) ≤
Corollary:
If open(δ) → True is not frequent, then all its specializations are not
frequent.
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 16 / 24
17. SIGMiner
Input:
- Etarg, Ei , Pi , i ∈ {1..n}
- Rij binary relations between Ei and Ej , i, j ∈ {1..n}
- MinCov.
Output:
- A set of characterization rules R and a tree representing the rules.
QP =empty string, response=T
while response do
Choose a quantifier q ∈ {∀, ∃}
Choose a buffer λ or a relation ri,j
Choose a parameter k for the quantifier
Choose a set of objects Ej ∈ {Ei , i ∈ {1..n}}
QP = QP.Qk
λ Ej
if L-coverage(Etarg − QP → True) ≥ MinCov then
foreach property p ∈ Pj do
if coverage(Etarg − QP → p, Etarg) ≥ MinCov then
if interesting(Etarg − QP → p) then
R=R ∪ {Etarg − QP → p}
if user no longer wishes to extend QP then
response=F
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 17 / 24
19. GIS Andes
Figure: Database schema of GIS Andes. Links represent an “is_distant”
relationship.
Pre-computation of the distance between objects, given a large
distance thresold
Pre-computation of relation tables between objects
Only rules with |novelty| ≥ 0.05 are kept.
novelty(r) =
|{o|o∈Etarg, Vr (o)=true}|
|E| -
|Etarg|
|E| · |{o|o∈E, Vr (o)=true}|
|E|
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 19 / 24
20. An example
Figure: Example of tree exploration in GISMiner.
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 20 / 24
23. Conclusion
Extension of the framework based on quantified paths
Introduction of distance-based relations for GIS
⇒ allows to model spatial buffers around objects, as suggested in
[PKDD 03]
Introduction of flexible operators ∃e
and ∀f
allowing much more
interesting rules
⇒ ∃e
is more interesting than ∀f
from the point of view of generality
An interactive algorithm for mining distance based geographic
rules.
In progress, an implementation of a relational rule mining system
performing a breadth-first search.
Interest of the formalism for learning in Description Logics?
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 23 / 24
24. Links with description logics
Let X0 − QR0
X1 . . . QRn−1
Xn → p, we associate
the atomic concept Xi to each type of object Xi
the role Ri to each relation Ri linking Xi to Xi+1
the concept P to the property p
quantified path + property representation in DL
∅ p P
∀Xi p Xi ∀Ri .P
∃Xi p Xi ∃Ri .P
∃e is a cardinality constraint.
Salleb, Vrain,Cassard (CCLS,LIFO,BRGM) Rules learning RuleML 2015 24 / 24