Fosdem 2013 petra selmer flexible querying of graph data

Flexible querying of graph data

Graph processing room
FOSDEM, 2 Feb 2013

Petra Selmer
petra.selmer.uk@gmail.com
http://www.dcs.bbk.ac.uk/~lselm01/

Introduction

 I shall be presenting my PhD topic which involves
a declarative query language allowing for the
flexible querying of graph-structured data with
complex paths.

2

Agenda

 Who (am I)?
 Why (the motivation)?
 Some background info
 What (is the query language and what
can it do)?
 Illustrative examples
 How (is it done)?

3

Who?

 Petra Selmer
 Part-time PhD student:
 Birkbeck College, University of London
 Prof. Alexandra Poulovassilis
 Dr. Peter T. Wood
 Software Architect:
 University College London’s Institute of Neurology
(Wellcome Trust Centre for Neuroimaging)

4

Why?

 Amount of graph-structured data is
growing fast
 The structure of this data is
becoming more complex, especially
when multiple, heterogeneous data
sources are integrated together
 The structure of the data is also
always subject to change...

5

Why?
 Users of such systems may not be familiar with the underlying data
structure: available paths etc
 The user may not be able to obtain meaningful answers (or indeed,
any answers) from the data IF the querying system is limited to exact
matching of users’ queries
 Also, the user may wish to explore the data by starting from a set of
initial answers and proceeding from there
 The user may additionally wish to derive some intelligence from the
connections....

The data

The query The user

6

Background: Ontologies

 Currently part of the Semantic Web stack (Tim Berners-
Lee, RDF, triple stores)
 Models a domain of interest: inferences, reasoning...
 It can be thought of as a “schema” for graph data
 The following inference rules are included (among
others):
 Subclass: ‘History’, ‘Languages’ are subclasses of
‘Humanities’
 Subproperty, Domain, Range...

7

What?
 Data model: G = (V, E)
 Very general model
 V : vertices (or nodes); each labelled with some
constant
 E : directed, labelled edges; labels drawn from an
alphabet {Ʃ U ‘type’}
 The query language is called Flex-It (it is
declarative)
 The basis is that of conjunctive regular path
queries
 There are two operators which may be applied to the
original query

8

What?
 Conjunctive regular path queries:
 This is where the graph's paths to be traversed are expressed with a
regular expression
 A single regular path query conjunct: (X, R, Y)
 X, Y: either constants or variables
 R: the regular expression
 “Conjunctive”: joining multiple conjuncts; e.g. (X, R1, Y), (Y,
R2, Z), (Z, R3, A)
 The Y’s are matched, the Z’s are matched etc

1) (N1, n+, ?Y):
n n p • Y = N2, N3
N1 N2 N3 N4
2) (N1, n*p, ?Y):
• Y = N4
9

What?
 Approximation allows for the approximate matching
of labels in the path
 An edit operation is applied to each edge label in
the path denoted by the regular expression:
 Edit operations: insertions, deletions, inversions,
substitutions and transpositions of labels
 Each operation has a ‘cost’: usually 1
 Example:
 Query conjunct: (X, a*.b, Y)
 R = a*.b [answers returned at cost 0]
 R’ = p.a*.b (insertion of ‘p’) [answers returned at cost 1]
 R’’ = p.a*.b- (inversion of ‘b’) [answers returned at cost 2]

10

What?
 Relaxation is applied by using inference
rules from an ontology (if one exists).
 Achieved by applying logical relaxation of the query
conditions using the data’s ontology definition
 Relaxation operations: subclass, subproperty, domain
and range
 Each operation has a ‘cost’ – usually 1
 Example:
 We have an ontology:
 Humanities (superclass)
 Languages and History (subclasses of Humanities)
 Assume our query states Languages may be relaxed
 Languages is relaxed to Humanities:
 Instances of Languages will be returned at cost 0
 Instances of History will be returned at cost 1

11

What?

 Answers are ranked according to how
closely they match the original query;
higher-cost answers have a lower ranking
 All answers at a certain distance d are
ranked the same and returned before
answers at a higher distance
 We allow for incremental execution: exact
answers returned first; then answers at
distance 1; ...
12

Example – ‘Lifelong learner metadata’

sc

History

13

 Query: “What work positions can I reach, having a degree in English”?
 Y = the episode; Z = the job
(?Y, ?Z) 
(?X, type, University),
(?X, qualif.type, EnglishStudies),
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
15

 Query: “What work positions can I reach, having a degree in English”?
 Y = the episode; Z = the job
(?Y, ?Z) 
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
 No results from User 2 will be returned...even though it is relevant!
16

 Allowing query approximation can yield some answers:
 Replacing the edge label prereq by next, at an edit cost of 1, we get this variant of the
query:
(?Y, ?Z) 
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
 prereq+ can be approximated by next.prereq* at edit distance 1:
 Result: Y = ep22, Z = AirTravelAssistant
17

 Allowing query approximation can yield some answers:
 Replacing the edge label prereq by next, at an edit cost of 1, we get this
variant of the query:
(?Y, ?Z) 
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
 next.prereq* can be approximated by next.next.prereq*, now at edit distance 2:
 Results:
 Y = ep23, Z = Journalist
 Y = ep24, Z = AssistantEditor
18

 Query: “What jobs are open to me if I study English, or something similar, at University”?
(?Y, ?Z) 
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
 In addition to the answers (from User 2) obtained by the previous query, we now also have
answers from the timeline of User 3
 prereq+ can be approximated by next.prereq* (distance 1) and EnglishStudies can be relaxed
– via Languages - to Humanities (distance 2), encompassing History
 Result: Y = ep32, Z = PersonalAssistant (distance of 3 from original query)
20

 Query: “What jobs are open to me if I study English, or something similar, at
University”?
(?Y, ?Z) 
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
 next.prereq* can be approximated by next.next.prereq* (distance 2), with
EnglishStudies again relaxed to Humanities (distance 2)
 Results: (both at distance 4 from the original query)
 Y = ep33, Z = Author
 Y = e34, Z = AssociateEditor
21

How?
 Theory
 Construction of a weighted non-deterministic finite
automaton (NFA) to represent the regular expression
 We apply new states and transitions to the NFA to represent the
approximation and relaxation operations
 Formation of a product automaton: NFA with data
graph G
 We perform a lowest cost path traversal of the product
automaton; construct query tree, do joins etc
 Polynomial time complexity
 Correctness of algorithms proven

22

How?

 Implementation of prototype
 Graph database: DEX (http://www.sparsity-
technologies.com/dex)
 Programming language: C#
 Further work
 New flexible operation combining APPROX and
RELAX  FLEX
 Optimisation!

23

Any questions?

Thank you for your attention!

petra.selmer.uk@gmail.com
24

Fosdem 2013 petra selmer flexible querying of graph data

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Fosdem 2013 petra selmer flexible querying of graph data

Semelhante a Fosdem 2013 petra selmer flexible querying of graph data (20)

Último

Último (20)

Fosdem 2013 petra selmer flexible querying of graph data