These are the slides from a talk I presented at the Graph Processing room at FOSDEM 2013, in which I discussed my PhD topic: a query language allowing for the flexible querying of complex paths within graph structured data
Fosdem 2013 petra selmer flexible querying of graph data
1. Flexible querying of graph data
Graph processing room
FOSDEM, 2 Feb 2013
Petra Selmer
petra.selmer.uk@gmail.com
http://www.dcs.bbk.ac.uk/~lselm01/
2. Introduction
I shall be presenting my PhD topic which involves
a declarative query language allowing for the
flexible querying of graph-structured data with
complex paths.
2
3. Agenda
Who (am I)?
Why (the motivation)?
Some background info
What (is the query language and what
can it do)?
Illustrative examples
How (is it done)?
3
4. Who?
Petra Selmer
Part-time PhD student:
Birkbeck College, University of London
Prof. Alexandra Poulovassilis
Dr. Peter T. Wood
Software Architect:
University College London’s Institute of Neurology
(Wellcome Trust Centre for Neuroimaging)
4
5. Why?
Amount of graph-structured data is
growing fast
The structure of this data is
becoming more complex, especially
when multiple, heterogeneous data
sources are integrated together
The structure of the data is also
always subject to change...
5
6. Why?
Users of such systems may not be familiar with the underlying data
structure: available paths etc
The user may not be able to obtain meaningful answers (or indeed,
any answers) from the data IF the querying system is limited to exact
matching of users’ queries
Also, the user may wish to explore the data by starting from a set of
initial answers and proceeding from there
The user may additionally wish to derive some intelligence from the
connections....
The data
The query The user
6
7. Background: Ontologies
Currently part of the Semantic Web stack (Tim Berners-
Lee, RDF, triple stores)
Models a domain of interest: inferences, reasoning...
It can be thought of as a “schema” for graph data
The following inference rules are included (among
others):
Subclass: ‘History’, ‘Languages’ are subclasses of
‘Humanities’
Subproperty, Domain, Range...
7
8. What?
Data model: G = (V, E)
Very general model
V : vertices (or nodes); each labelled with some
constant
E : directed, labelled edges; labels drawn from an
alphabet {Ʃ U ‘type’}
The query language is called Flex-It (it is
declarative)
The basis is that of conjunctive regular path
queries
There are two operators which may be applied to the
original query
8
9. What?
Conjunctive regular path queries:
This is where the graph's paths to be traversed are expressed with a
regular expression
A single regular path query conjunct: (X, R, Y)
X, Y: either constants or variables
R: the regular expression
“Conjunctive”: joining multiple conjuncts; e.g. (X, R1, Y), (Y,
R2, Z), (Z, R3, A)
The Y’s are matched, the Z’s are matched etc
1) (N1, n+, ?Y):
n n p • Y = N2, N3
N1 N2 N3 N4
2) (N1, n*p, ?Y):
• Y = N4
9
10. What?
Approximation allows for the approximate matching
of labels in the path
An edit operation is applied to each edge label in
the path denoted by the regular expression:
Edit operations: insertions, deletions, inversions,
substitutions and transpositions of labels
Each operation has a ‘cost’: usually 1
Example:
Query conjunct: (X, a*.b, Y)
R = a*.b [answers returned at cost 0]
R’ = p.a*.b (insertion of ‘p’) [answers returned at cost 1]
R’’ = p.a*.b- (inversion of ‘b’) [answers returned at cost 2]
10
11. What?
Relaxation is applied by using inference
rules from an ontology (if one exists).
Achieved by applying logical relaxation of the query
conditions using the data’s ontology definition
Relaxation operations: subclass, subproperty, domain
and range
Each operation has a ‘cost’ – usually 1
Example:
We have an ontology:
Humanities (superclass)
Languages and History (subclasses of Humanities)
Assume our query states Languages may be relaxed
Languages is relaxed to Humanities:
Instances of Languages will be returned at cost 0
Instances of History will be returned at cost 1
11
12. What?
Answers are ranked according to how
closely they match the original query;
higher-cost answers have a lower ranking
All answers at a certain distance d are
ranked the same and returned before
answers at a higher distance
We allow for incremental execution: exact
answers returned first; then answers at
distance 1; ...
12
15. Query: “What work positions can I reach, having a degree in English”?
Y = the episode; Z = the job
(?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
15
16. Query: “What work positions can I reach, having a degree in English”?
Y = the episode; Z = the job
(?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
No results from User 2 will be returned...even though it is relevant!
16
17. Allowing query approximation can yield some answers:
Replacing the edge label prereq by next, at an edit cost of 1, we get this variant of the
query:
(?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
prereq+ can be approximated by next.prereq* at edit distance 1:
Result: Y = ep22, Z = AirTravelAssistant
17
18. Allowing query approximation can yield some answers:
Replacing the edge label prereq by next, at an edit cost of 1, we get this
variant of the query:
(?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
next.prereq* can be approximated by next.next.prereq*, now at edit distance 2:
Results:
Y = ep23, Z = Journalist
Y = ep24, Z = AssistantEditor
18
20. Query: “What jobs are open to me if I study English, or something similar, at University”?
(?Y, ?Z)
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
In addition to the answers (from User 2) obtained by the previous query, we now also have
answers from the timeline of User 3
prereq+ can be approximated by next.prereq* (distance 1) and EnglishStudies can be relaxed
– via Languages - to Humanities (distance 2), encompassing History
Result: Y = ep32, Z = PersonalAssistant (distance of 3 from original query)
20
21. Query: “What jobs are open to me if I study English, or something similar, at
University”?
(?Y, ?Z)
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
next.prereq* can be approximated by next.next.prereq* (distance 2), with
EnglishStudies again relaxed to Humanities (distance 2)
Results: (both at distance 4 from the original query)
Y = ep33, Z = Author
Y = e34, Z = AssociateEditor
21
22. How?
Theory
Construction of a weighted non-deterministic finite
automaton (NFA) to represent the regular expression
We apply new states and transitions to the NFA to represent the
approximation and relaxation operations
Formation of a product automaton: NFA with data
graph G
We perform a lowest cost path traversal of the product
automaton; construct query tree, do joins etc
Polynomial time complexity
Correctness of algorithms proven
22
23. How?
Implementation of prototype
Graph database: DEX (http://www.sparsity-
technologies.com/dex)
Programming language: C#
Further work
New flexible operation combining APPROX and
RELAX FLEX
Optimisation!
23
24. Any questions?
Thank you for your attention!
petra.selmer.uk@gmail.com
24