1. +
Question Answering on
Interlinked Data
Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Soeren Auer
AKSW Research Group, Leipzig University
December 5 2013, IBM Research Center
3. + Motivation
Text
queries
(either
keyword
or
natural
language
)
are:
n
Simple
retrieval
approach
n
Popular
n
Implicit
and
ambiguous
seman=cs.
SPARQL
queries
require:
n
Knowledge
about
the
ontology
n
Proficiency
in
formula=ng
formal
queries
n
Explicit
and
unambigious
seman=cs.
AKSW
group
-‐
Ques=on
Answering
on
Interlinked
Data
(published
in
www2013)
3
4. + Comparison of Search Approaches
Data-Semantic
aware
Data-Semantic
unaware
Our
approach:
SINA
4
Question
Answering
Systems
Information
Retrieval
Keyword-based
query
AKSW group - Question Answering on Interlinked Data (published in www2013)
Natural language
query
5. + Example
5
1
n
3
Which televisions shows were created by Walt Disney?
select * where !
{ ?v0 a
!
?v0 dbo:creator
AKSW group - Question Answering on Interlinked Data (published in www2013)
2
!dbo:TelevisionShow.!
dbr:Walt_Disney. }!
6. + Aim and Challenges
Aim: Question answering over a set of interlinked data sources.
n
Query segmentation.
n
Resource disambiguation.
n
To construct a formal query (expressed in SPARQL)
AKSW group - Question Answering on Interlinked Data (published in www2013)
6
7. + Further Challenges over Interlinked Data
1.
Information for answering a certain question can be spread
among different datasets employing heterogeneous schemas.
2.
Constructing a federated formal query across different datasets
requires exploiting links between the different datasets on both the
schema and instance levels.
AKSW group - Question Answering on Interlinked Data (published in www2013)
7
9. + Test bed datasets
* One single dataset: DBpedia.
* Three interlinked datasets
from life-science:
ü Drugbank: is a
comprehensive knowledge
base containing information
about drugs, drug target (i.e.
protein) information,
interactions and enzymes.
ü Diseasome: contains
information about diseases and
genes associated with these
diseases.
ü Sider: contains information
about drugs and their side effects.
AKSW group - Question Answering on Interlinked Data (published in www2013)
9
10. + Main characteristics of federated queries
1.
Queries requiring fused information, e.g. side
effects of drugs used for Tuberculosis.
2.
Queries targeting combined information, e.g.
side effect an enzymes of drugs used for ASTHMA.
3.
10
Queries requiring keyword expansion, e.g. side
effects of Valdecoxib.
DrugBank
Sider
Drug
a
a
?v1
enzyme
?v0
Disease
?v2
sameAs
a
Diseasome
AKSW group - Question Answering on Interlinked Data (published in www2013)
Side Effect
Drug
a
Enzymes
Asthma
a
side effect
?v3
11. + Challenge 1: Query Segmentation and Resource
Disambiguation
l
Sample
ques5on:
What
is
the
side
effects
of
drugs
used
for
Tuberculosis?
l
Transformed
to
4-‐tuple
(side
#
effect
#
drug
#
Tuberculosis)
l
Different
segmenta=ons
are
possible:
1.
(
side
effect
#
drug
#
Tuberculosis)
2.
(
side
effect
drug
#
Tuberculosis
)
Mapping
of
the
segments
to
the
resources
in
the
underlying
knowledge
bases.
Each valid segment
AKSW group - Question Answering on Interlinked Data (published in www2013)
11
12. 12
Segment validation
ü
ü
Original tuple: (side # effect # drug # Tuberculosis).
Using a naive approach for finding all valid segments.
Valid Segments
Samples of Candidate Resources
Side effect
1. sider:class:sideeffect
!
2. sider:property:side_effects!
drug
1. drugbank: drugs
2.class:offer!
3.sider:drugs
4.diseases:possibledrug!
tuberculosis
1. diseases:1154
!
2. side_effects: C0041296!
AKSW group - Question Answering on Interlinked Data (published in www2013)
14. 14
Hidden Markov Model
•
•
•
•
A statistics model containing a set of states.
Moving from one state to another state generates a sequence of observations.
The probability of entering state only depends on the previous state.
Output is the most likely states generating the sequence of the observation.
AKSW group - Question Answering on Interlinked Data (published in www2013)
15. 15
State Space
•
•
•
•
A state represents a knowledge base resource.
Contains all resources in the knowledge base.
In practice, we prune the state space by excluding irrelevant states.
Adding an unknown entity state comprising all resources, which are not
available (anymore) in the pruned state space.
• Extension of State Space with reasoning: An extension of the state space
by including resources inferred from lightweight owl:sameAs reasoning.
AKSW group - Question Answering on Interlinked Data (published in www2013)
16. 16
Bootstrapping the Model Parameters
Emission Probability
•
The set-similarity level measures the difference between the label and the
segment in terms of the number of words using the Jaccard similarity.
•
The string-similarity level measures the string similarity of each word in the
segment with the most similar word in the label using the Levenshtein
distance.
AKSW group - Question Answering on Interlinked Data (published in www2013)
17. 17
Bootstrapping the Model Parameters
Transition Probability & Initial Probability
• Computing the transition probability and initial probability based on Semantic
relatedness of two resources.
• Semantic relatedness is based on two values: distance and connectivity
degree.
• We transform these two values to hub and authority values using HITS
algorithm.
• Initial probability and Transition probability
are defined as a uniform
distribution over the hub and and authority values.
AKSW group - Question Answering on Interlinked Data (published in www2013)
18. Evaluation of Bootstrapping
18
• The accuracy of different distribution functions, i.e., Normal, Zipfian and
uniform distributions for transition probability.
• We ran the distribution functions with two different inputs, i.e. distance and
connectivity degree values as well as hub and authority values.
AKSW group - Question Answering on Interlinked Data (published in www2013)
19. + Viterbi Algorithm
Aim: The most likely path generating the sequence of input keywords.
AKSW group - Question Answering on Interlinked Data (published in www2013)
19
20. +
20
Output of the HMM for the following query:
Which televisions shows were created by Walt Disney?
Probability
0.0023
0.0014
5.89E-4
3.53E-4
3.76E-5
Path of states
dbo:TelevisionShow , dbo:creator , dbr:
dbo:TelevisionShow , dbo:creator , dbr:
dbr:TelevisionShow , dbo:creator , dbr:
dbr:TelevisionShow , dbo:creator , dbr:
dbp:television , dbp:show , dbo:creator
AKSW group - Question Answering on Interlinked Data (published in www2013)
Walt_Disney!
Category:Walt_Disney!
Walt_Disney!
Category:Walt_Disney!
, dbr: Category:Walt_Disney!
21. +
21
Query Construction
AKSW group - Question Answering on Interlinked Data (published in www2013)
22. Query Construction Method
Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, connected multi-graph.
1
2
n
Forward Chaining:
1. CT: Comprehensive type.
2. CD: Comprehensive domain.
3. CR: Comprehensive range.
AKSW group - Question Answering on Interlinked Data (published in www2013)
22
23. Query Construction Method
Input: set of resources R = {r , r ,..., r }
Output: A query graph QG = (V, E)
is a directed, connected multi-graph.
1
2
n
Generating the Incomplete Query Graph (IQG)
Initializing vertices and primary edges.
• A vertex is added to IQG (1) If r is an instance, (2) If r is a class.
• Properties are added along with zero, one or two vertices.
AKSW group - Question Answering on Interlinked Data (published in www2013)
23
24. 24
Query Construction Method
Example: What is the side effects of drugs used for Tuberculosis?
• diseasome:1154 !
!
• diseasome:possibleDrug !
• sider:sideEffect !
!(type
!(type
!(type
Graph 1
!!
property)
sideEffect
possibleDrug
1154
instance) !!
property)!
?v0
?v1
Graph 2
AKSW group - Question Answering on Interlinked Data (published in www2013)
?v2
25. 25
Query Construction Method
Connecting Sub-graphs of an IQG:
1. Minimum spanning tree: a minimum set of edges (i.e., properties) to span a set of
disjoint graphs.
2. Prim’s algorithm: incrementally includes edges to connect disjoint sub-graphs.
• Direct properties: ?v0 ?p ?v1.
• Properties via owl:sameAs link.
(1) ?v0 owl:sameAs ?x. ?x ?p ?v1. !
(2) ?v0 ?p ?x. ?x owl:sameAs ?v1. !
(3) ?v0 owl:sameAs ?x. ?x ?p ?y. ?y owl:sameAs ?v1. !
Template 1
Template 2
possibleDrug
1154
?v0
1154
?v2
?v1
sideEffect
?v1
AKSW group - Question Answering on Interlinked Data (published in www2013)
possibleDrug
sideEffect
?v0
?v2
26. Evaluation
Goal of experiment:
How well:
1. resource disambiguation
2. query construction approaches perform.
Measurement of the performance:
1. For disambiguation using the Mean Reciprocal Rank (MRR).
2. Query construction in terms of precision and recall.
Benchmark
1. A natural- language query and the equivalent conjunctive SPARQL query.
2. 25 queries on the 3 interlinked datasets Drugbank, Sider and Diseasome.
3. QALD1 and QALD3 benchmark for DBpedia.
AKSW group - Question Answering on Interlinked Data (published in www2013)
26
27. Evaluation using life-science datasets
Without reasoning: precision = 0.91 recall = 0.88
With reasoning:
precision = 0.95 recall = 0.90
AKSW group - Question Answering on Interlinked Data (published in www2013)
27
28. + Evaluation using DBpedia
n
QALD3 Benchmark:
ü
contains 100 questions.
ü
32 original questions can be answered correctly.
n
QALD1 Benchmark:
ü
contains 50 questions.
ü
7 complex questions.
ü
13 questions requiring information beyond DBpedia, i.e., from YAGO and FOAF.
ü
14 slightly were modified to remove expansion and cleaning problem.
ü
MRR of disambiguation = 96%
ü
Query construction accuracy = 83%
AKSW group - Question Answering on Interlinked Data (published in www2013)
28
29. Runtime
Parallization over three components:
1. Segment validation
2. Resource retrieval
3. Query construction
AKSW group - Question Answering on Interlinked Data (published in www2013)
29
30. + Related work
AKSW group - Question Answering on Interlinked Data (published in www2013)
30