Handwritten Text Recognition for manuscripts and early printed texts
Using Knowledge Graph for Promoting Cognitive Computing
1. Using Knowledge Graph for
Promoting Cognitive
Computing
Presenter: Dr. Saeedeh Shekarpour
2/10/2017
1
2. About me
Education
• 2010-2013: PhD student, AKSW Research
Group, Leipzig University, Germany
• 2014-2015: PhD/Postdocs, EIS Research
Group, Bonn University, Germany
• 2016-present: Postdocs, Knoesis Center, USA
2/10/2017
2
3. About me
Research Interest
6+ years experience in research in the following directions:
• Previously:
• Question Answering Systems, Semantic Search.
• Linked Data and Semantic Web Technologies.
• Statistical classifier models (e.g. HMM).
• Ontology Development.
• Natural Language Processing.
• Currently:
• Information Extraction and Knowledge graph Creation.
• Mining Social Network.
• Experiencing Deep Learning.
2/10/2017
3
4. About me
Selected Publications
• Saeedeh Shekarpour, Edgard Marx, Sören Auer, Amit Sheth:
RQUERY: Rewriting Natural Language Queries on Knowledge Graphs to
Alleviate the Vocabulary Mismatch Problem. AAAI 2017
• Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Sören Auer:
Question answering on interlinked data. WWW 2013: 1145-1156
• Andreas Both, Dennis Diefenbach, Kuldeep Singh, Saeedeh Shekarpour,
Didier Cherix, Christoph Lange: Qanary - A Methodology for Vocabulary-
Driven Open Question Answering Systems. ESWC2016: 625-641
• Saeedeh Shekarpour, Sören Auer, Axel-Cyrille Ngonga Ngomo, Daniel
Gerber, Sebastian Hellmann, Claus Stadler: Keyword-Driven SPARQL Query
Generation Leveraging Background Knowledge. Web Intelligence 2011:
203-210
2/10/2017
4
5. About me
Selected Publications
• Saeedeh Shekarpour, Konrad Höffner, Jens Lehmann, Sören Auer: Keyword
Query Expansion on Linked Data Using Linguistic and Semantic Features.
ICSC 2013: 191-197
• Saeedeh Shekarpour, Edgard Marx, Axel-Cyrille Ngonga Ngomo, Sören Auer:
SINA: Semantic interpretation of user queries for question answering on
interlinked data. J. Web Sem. 2015
2/10/2017
5
6. Outline
Introduction
Part 1: Vision
Advantages of using Knowledge Graph in
Question Answering
Machine Learning
NLP
Information Retrieval
Part 2: Research in depth
RQUERY: Rewriting Natural Language Queries on Knowledge Graphs
to Alleviate the Vocabulary Mismatch Problem
HeadEX: Triple Extraction from Stream of News Headlines on Twitter
using n-ary Relations
2/10/2017
6
7. Prevalence of using KG
• Google knowledge graph
• IBM Watson
• Using knowledge graph in smart phone
Google Now
2/10/2017
7
8. 2/10/2017
8
The growth of Linked Open Data
EIS research group - Bonn University
8
January 2017
2973 Datasets
More than 140 billion triples
May 2007
12 Datasets
7 January 2015
9. Outline
Introduction
Part 1: Vision
Advantages of using Knowledge Graph in
Question Answering
Machine Learning
NLP
Information Retrieval
Part 2: Research in depth
RQUERY: Rewriting Natural Language Queries on Knowledge Graphs
to Alleviate the Vocabulary Mismatch Problem
HeadEX: Triple Extraction from Stream of News Headlines on Twitter
using n-ary Relations
2/10/2017
9
11. 2/10/2017
11
Objective:Transformationfrom Textual
Query to formal Query
Which televisions shows were created by Walt Disney?
7 January 2015EIS research group - Bonn University
11
SELECT * WHERE
{ ?v0 a dbo:TelevisionShow.
?v0 dbo:creator dbr:Walt_Disney. }
1
2
3
13. HowcanKGfacilitateexploitinganswerfromseveralsources?
• Using interlinked datasets enables exploiting information
which are spread across diverse datasets.
• Horizontal search is applicable, decomposing question is not
necessary.
2/10/2017
13
ntaining information
information, interac-
n Figure 1 the classes
der are linked using
me are linked to drugs
d possible Disease
een Sider and Disea-
property. Note that
nt the properties be-
h the following three
m at ion: An example
gs used for Tubercu-
Diseasome, drugs for
in Drugbank, while
nfor m at ion: An ex-
e query: “ side e↵ect
ASTHMA”. Here the
e obtained by joining
Drugbank (enzymes,
pansion: An exam-
aldecoxib”. Here the
d in Sider, however,
ia Sider.
roach is the first ap-
erlinked datasets by
Diseasome
Sider
Drugs
sameAs
Disease
Drug Side Effect
Genes
enzymes
Drug interactions
references targets
DrugBank
Figur e 1: Schem a int er linking for t hr ee dat aset s i.e.
D r ugB ank, Sider , D iseasome.
Diseasome
Drug
Asthma
?v0
side effectsameAs
a
?v2 ?v3
Disease
Drug Side Effect
a a
a
?v1
enzyme
Enzymes
a
SiderDrugBank
Figur e 2: R esour ces fr om t hr ee di↵er ent dat aset s
Query: What are the side effects of drugs used for Tuberculosis?
Saeedeh Shekarpour, Axel-Cyrille Ngonga Ngomo, Sören Auer: Question answering on
interlinked data. WWW 2013: 1145-1156
14. HowcanKGbenefitmachinelearning approaches?
Structure and semantics of Data can be employed as the
emerging features in the machine learning approaches.
• Structural features are mainly graph-based parameters such
as
Paths between entities.
Popularity degree
Frequency
in-degree
out-degree
Cliques on graph
2/10/2017
14
15. HowcanKGbenefitmachinelearning approaches?
• Semantics features are such as
Schema-aware features:
Hierarchy of concepts
Label of properties
Direction of properties
Domain and range of properties
Aligning ontologies and vocabularies across various domain
Data-driven features:
Type of entities.
Traversing owl:sameAs links
2/10/2017
15
16. QueryExpansionTask
Linguisticvs.SemanticFeaturesforQueryExpansionTask
• Linguistic features from WordNet:
Synonyms: words having a similar meanings.
Hyponyms: words representing a specialization of the input.
Hypernyms: words representing a generalization of the input.
• Semantic Features from Linked Data:
Using owl:sameAs. And rdfs:seeAlso: using rdfs:seeAlso.
Using owl:equivalentClass and owl:equivalentProperty.
Following the rdfs:subClassOf or rdfs:subPropertyOf property.
Following the rdfs:subClassOf or rdfs:subPropertyOf.
Using skos:broader and skos:broadMatch.
Using skos:narrower and skos:narrowMatch.
Using skos:closeMatch, skos:mappingRelation and skos:exactMatch.
2/10/2017
16
17. Exemplaryexpansiongraphof the word
movie
2/10/2017
17
movie
home movieproduction
film
motion
picture show
video
telefilm
Saeedeh Shekarpour, Konrad Höffner, Jens Lehmann, Sören Auer: Keyword Query Expansion on Linked
Data Using Linguistic and Semantic Features. ICSC 2013: 191-197
18. HowcanKGpromoteNLPapproaches?
• Still the type of recognized entities by NER are limited to types
such as Person, Organization, Place, Date, Time.
• With the support of KG, NER tools can be schema-aware and
extended in order to
Find new entities e.g. name of drugs, animals
Remove case sensitivity from NER
Have schema-aware annotations, e.g.
President Barack Obama tweeted the American people in his final hours as head of state promising to continue his
work with them, and unveiling a new website.
2/10/2017
18
Person President
Father
Spous
e
21. HowcanKGbenefitIR approaches?
• Our search engines are not limited to keyword-based retrieval
• Search engines are moving towards to semantic retrieval & QA
• KG enables us to template-based approaches.
2/10/2017
21
23. Categorization
basedonthematterofinformation
Finding special characteristics of an instance
Finding similar instances
Finding associations between instances
23Saeedeh Shekarpour, Sören Auer, Axel-Cyrille Ngonga Ngomo, Daniel Gerber, Sebastian Hellmann, Claus
Stadler: Keyword-Driven SPARQL Query Generation Leveraging Background Knowledge. Web
Intelligence 2011: 203-210
24. Samples of keywords and results
2/10/2017
24Saeedeh Shekarpour, Sören Auer, Axel-Cyrille Ngonga Ngomo, Daniel Gerber, Sebastian Hellmann, Claus
Stadler: Keyword-Driven SPARQL Query Generation Leveraging Background Knowledge. Web
Intelligence 2011: 203-210
25. Outline
Introduction
Part 1: Vision
Advantages of using Knowledge Graph in
Question Answering
Machine Learning
NLP
Information Retrieval
Part 2: Research in depth
RQUERY: Rewriting Natural Language Queries on Knowledge Graphs
to Alleviate the Vocabulary Mismatch Problem
HeadEX: Triple Extraction from Stream of News Headlines on Twitter
using n-ary Relations
2/10/2017
25
26. InputQuery & VocabularyMismatchProblem
• It is likely that the input queries do not match with the background
knowledge.
• Query expansion and query rewriting are solutions for this problem.
• But they are in danger of potentially yielding a large number of
irrelevant words, which in turn negatively influences runtime as well
as accuracy.
Input Query
2/10/2017
26
k1k2 k3
10 ´10 ´10
Saeedeh Shekarpour, Edgard Marx, Sören Auer, Amit Sheth: RQUERY: Rewriting Natural Language
Queries on Knowledge Graphs to Alleviate the Vocabulary Mismatch Problem. AAAI 2017
27. RQUERY Overview
I. Segment Generation: (1) Tokenization and stop word removal. (2) We generate all possible
segments which can be derived from q.
II. Segment Expansion: This module expands segments derived from the previous module using a
linguistic the thesaurus using linguistic features of WordNet as (1) synonyms (2) hypernyms.
III. Derived Word Validation: Each derived word is validated against the background knowledge
base.
IV. Detecting and ranking possible query rewrites: We aim at distinguishing and ranking possible
query rewrites. We address the problem of finding the appropriate query rewrite by employing
a Hidden Markov Model (HMM) in three steps:
i. The state space is populated.
ii. Transitions between states are established.
iii. Parameters are bootstrapped.
2/10/2017
27
RDF Knowledge Base
External Resources
RQUERY
WordNet
Segment
generation
Segment
expansion
Derived
word
validation
Detecting and
ranking query
rewrites model
construct
Input textual
query
Ranked list of
rewritten queries
28. Example – Part 1
2/10/2017
28
• Input Query: ‘What is the profession of bandleader?’
• Steps:
1) RQUERY derives and validates 10 words for the two given input keywords.
2) The state space is populated with all of these 10 validated words.
3) Then, all the transitions between states are recognized and established.
band
leader
director
music
director
conductor
occupation
profession
line
business
vacation
job
Start
profession bandleader
Observation 1 Observation 2
29. Example – Part 2
4) Finally, we run the Viterbi algorithm, which is a dynamic programming approach for
finding the optimal path through a HMM. This algorithm discovers the most likely states
that the sequence of input keywords is observable through.
5) Thus, after running the Viterbi algorithm for the running query “profession of
bandleader”, the generated top-6 outputs are as follows:
2/10/2017
29
30. Methodology: Modeling by HMM
2/10/2017
30
• B : X ⇥Y ! [0, 1] represents theemission matrix. Each
entry bi − seg = P(seg|Si ) is the probability of emitting
thesegment seg from thestateSi .
• ⇡ : X ! [0, 1] denotes theinitial probability of states.
We define the basic problem as follows: the sequence
of input keywords q and the model λ are given, and the
problem is to find the optimal sequence of states qr =
(S1, S2, ..., Sm ) which explain thegiven observation, i.e. in-
put query q(k1, ..., kn ). Please note that there are possibly
multiple distinct sequences of states which the given input
query q is observable through, thus the aim is obtaining the
optimal one; formally as: γ = argmaxqr
{ P(qr | q, λ)} .
P(qr | q, λ)} istheprobability of observing thegiven query
qthrough thesequenceof statesqr . For computing theprob-
ability of any query rewriteqr , themodel λ playsaroleasa
constant parameter, thusweassume
P(qr | q, λ)} ⇡ P(qr | q) =) γ = argmax
qr
{ P(qr | q)}
Assuming that qr is a sequence of states (S1...Sm ) (please
(a
pr
(d
ob
parameters of our HMM. Formally, a HMM is a quintuple
λ = (X , Y, A, B , ⇡ ) where:
• X is a finite set of states. In our case, X equals the set of
thevalidated derived wordsW . In other words, each word
w 2 W forms a state.
• Y denotes the set of observations. Here, Y equals the set
of all segments 8seg 2 S derived from the input n-tuple
of keywords q.
• A : X ⇥ X ! [0, 1] is the transition matrix. Each entry
ai j is the transition probability P(Sj |Si ) from state Si to
state Sj .
• B : X ⇥Y ! [0, 1] represents the emission matrix. Each
entry bi − seg = P(seg|Si ) is the probability of emitting
the segment seg from the state Si .
• ⇡ : X ! [0, 1] denotes the initial probability of states.
We define the basic problem as follows: the sequence
of input keywords q and the model λ are given, and the
problem is to find the optimal sequence of states qr =
(S1, S2, ..., Sm ) which explain thegiven observation, i.e. in-
For insta
pr of ess
from the
Transitio
tween sta
We adop
traditiona
RDF kno
co-occurr
scriptions
s
l
w1
(a)
predicat
31. Triples
• A triple has subject–predicate–object structure
• Jack knows Ann
2/10/2017
31
Subject Object
Predicate
Jack Ann
knows
32. Triple-based Co-occurence
where:
states. In our case, X equals the set of
d wordsW. In other words, each word
e.
f observations. Here, Y equals the set
eg 2 S derived from the input n-tuple
1] is the transition matrix. Each entry
probability P(Sj |Si ) from state Si to
] represents the emission matrix. Each
seg|Si ) is the probability of emitting
m the state Si .
otes the initial probability of states.
c problem as follows: the sequence
and the model λ are given, and the
he optimal sequence of states qr =
h explain thegiven observation, i.e. in-
). Please note that there are possibly
ences of states which the given input
through, thus the aim is obtaining the
as: γ = argmaxqr
{ P(qr | q, λ)} .
obability of observing thegiven query
e of states qr . For computing theprob-
write qr , the model λ plays arole asa
us weassume
qr | q) =) γ = argmax
qr
{ P(qr | q)}
sequence of states (S1...Sm ) (please
corresponds to the word wi ). We ex-
qr | q) = P(S1...Sm | k1...kn ). The
ng the keyword ki from the state Sj is
. Asfrom astate Si either oneor mul-
be observable, the number of states
o the number of keywords m < = n.
v property, the probability of reach-
observing the keyword kn is equal to
n | Sm ). Thus, theequation (2) can be
Sm − 1)⇤P(kn | Sm ))⇤P(S1...Sm − 1 |
pr of essi on, so the keyword pr of essi on is emitted
from the state associated with the word j ob.
Transitions between States. We define transitions be-
tween statesbased ontheconcept of co-occurrenceof words.
We adopt the concept of co-occurrence of words from the
traditional information retrieval context and move it to the
RDF knowledge bases. Triple-based co-occurrence means
co-occurrence of words in literals found in the resource de-
scriptions of thetwo resources of agiven triple:
s p o
l
w1
l
w2
(a) subject-
predicate.
s p o
l
w1
l
w2
(b) subject-object.
s p w2
l
w1
(c) subject-literal.
s p o
l
w2
l
w1
(d) predicate-
object.
s p w2
l
w1
(e) predicate-
literal.
s" p" o"
a"
c"
l"
‘w2’"
‘w1’"l"
(f) predicate-Type
of subject.
s" p" o"
l"‘w2’"
l"
‘w1’"
a"
c"
(g) predicate-Type of ob-
ject.
Figure 3: The graph patterns employed for recognising co-
occurrence of the two given words w1 and w2. Please note
that theletterss, p, o, c, l and arespectively stand for subject,
predicate, object, class, rdfs:label and rdf:class.
2/10/2017
32
33. Evaluation
Evaluation Criteria: The goal of our evaluation is investigating positive as well as
negative impacts of the proposed approach by raising the following two
questions:
① How effective is the approach for addressing the vocabulary mismatch problem when
employing queries having a vocabulary mismatch problem?
② How effective is the approach for avoiding noise when employing queries
which do not have a vocabulary mismatch problem?
We employ Mean Reciprocal Rank (MRR)?
Benchmark: we use an evaluation test collection for schema-agnostic query
mechanisms on RDF datasets (i.e. DBpedia) presented in ESWC 2015.
https://sites.google.com/site/eswcsaq2015/documents
2/10/2017
33
34. Evaluation
• Bootstrapping:
• Issue: Since we encounter a dynamic modeling meaning state space as well as issued
observation (i.e., sequence of input keywords) vary query by query. Thus, learning probability
values should be generic and not query-dependent because learning model probabilities for
each individual query is not feasible.
• Solution: Thus, we rely on bootstrapping, a technique used to estimate an unknown
probability distribution function. We apply three distributions (i.e., normal, uniform and
zipfian) to find out the most appropriate distribution.
2/10/2017
34
0.76
0.51
0.69
0.85
0.44
0.82
0.68
0.58
0.63
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Uniform Distribu on Normal Distribu on Zipfian Distribu on
MeanReciprocalRank
All Queries Q1-10 Q11-20
35. Evaluation Results
0.00
0.20
0.40
0.60
0.80
1.00
Q12 Q15 Q18 Q20 Q21 Q24 Q29 Q31 Q40 Q51 Q54 Q65 Q70 Q76 Q78 Q84
ReciprocalRank
HMM with Implicit Frequency HMM with Explicit Frequency n-gram Language Model
2/10/2017
35
0.00
0.20
0.40
0.60
0.80
1.00
Q2 Q3 Q5 Q8 Q10 Q16 Q22 Q34 Q37 Q46 Q48 Q49 Q50 Q58 Q59 Q63 Q64 Q69 Q85 Q91 Q93
ReciprocalRank
HMM with Implicit Frequency HMM with Explicit Frequency n-gram model
Queries which do not have a mismatch problem
Queries which have a mismatch problem
36. Outline
Introduction
Part 1: Vision
Advantages of using Knowledge Graph in
Question Answering
Machine Learning
NLP
Information Retrieval
Part 2: Research in depth
RQUERY: Rewriting Natural Language Queries on Knowledge Graphs
to Alleviate the Vocabulary Mismatch Problem
HeadEX: Triple Extraction from Stream of News Headlines on Twitter
using n-ary Relations
2/10/2017
36
39. CEVO: Cognitive annotation on relations
• Problem:
Relation Extraction
Contextual equivalence of relations
Diversity in Conceptualization
Requirements:
Relation tagging on textual data
Relation linking
Integration and alignment of properties
Simplicity
Reusability
2/10/2017
39
40. CEVO: Cognitive annotation on relations
• CEVO is built up on Levin ‘s categorization on
English verbs.
• CEVO has an abstract conceptualization
• You can find CEVO at http://eventontology.org
2/10/2017
40
41. Background Data Model
the meet event is associated with entities with type of Par t i ci pant and Topi c
(i.e., topic discussed in themeeting). Considering the sample of tweets in Table ??, the
tweets no1, no4, no7 are instances of the event Communi cat i on with the mentions
t el l , say, announce. Thetweets no2, no5, no8 areinstances of theevent Meet
with thementions meet , vi si t . Thetweets no3, no6, no9 areinstances of theevent
Mur der with the mention ki l l .
subclass
Generic Event
Communica3on Meet
Publisher
Published
By
subclass
xs:date
Murder
subclass
published
date
Loca3on
Time
occurredIn
occurredon
(a) SubClasses of Event
Meet
Par( cipant
Topic
about
A2ended
in
(b) Meet Class
Communica) on
Giver Addressee
Message
expressed
says
addressed
(c) Communication Class
Murder
Vic*m
cause
Killer
quan*ty
kills
killed
caused xs:string
xs:integer
expression
(d) Murder Class
Fig. 1: Subclasses of theGeneric Event.
2/10/2017
41
42. Example
Tweet #2: Instagram CEO meets with @Pontifex to discuss "the power
of images to unite people".
1. :Meet#1 a :Meet ; rdfs:label `meets' .
2. :e1 a :Participant ; rdfs:label `Instagram CEO' .
3. :e2 a :Participant ; rdfs:label `@Pontifex'
4. :t1 a :Topic ;
:body `to discuss the power of images to unite people' .
5. :e1 :attendedIn Meet#1 .
6. :e2 :attendedIn Meet#1 .
7. :Meet#1 :about :t1 .
8. :Meet#1 :publisher :CNN .
9. :Meet#1 :date `26/2/2106' .
2/10/2017
42
44. Entity ExtractionusingLinguisticAnalysis
2/10/2017
44
withInstagram CEO @Pon4fex the powerto
compound
case mark det
of images peopleto unitediscuss
dobj
case
nmod mark
dobj
acl
Fig. 2: Dependency tree for the running example.
Definition 3 (Dependent Chunk of ROOT). Dependent Chunk of ROOT (DCR) is the
longest sequence of tokens of a given tweet that satisfies the following conditions: (i)
There is one token that is (directly) dependent on the root, and (ii) any other token
included in a given chunk is dependent on a token already within the given chunk.
Moreover, ROOT is an individual chunk.
Example 2 (Chunking Tweet). We chunk the running example based on the concept
of ROOT Dependent Chunk (RDC). Figure 3 shows the resulting chunks. Except for
the chunk of root (because root is an individual chunk), any other chunk has only one
token that is dependent on the root (only one outgoing arrow to the root) and other
tokens inside that chunk co-reference interior tokens (interior arrows). According to this
definition, the example tweet contains four individual chunks. For the chunk ‘ Instagram
CEO’ , only the token ‘ CEO’ is dependent on the root and the other token ‘ instagram’
is dependent on the interior token ‘ CEO’ .
meets
Instagram CEO With @Pon4fex
nsubj xcomp
compound
case mark det
To discuss the power of images to unite people
nmod
dobj
case
nmod mark
dobj
acl
ROOT
Chunk 1 Chunk 2 Chunk 4
Chunk 3
Fig. 3: Chunking the running example based on the concept of Root Dependent Chunk.
meets
withInstagram CEO @Pon4fex the powerto
nsubj xcomp
compound
case mark det
of images peopleto unitediscuss
nmod
dobj
case
nmod mark
dobj
acl
ROOT
Fig. 2: Dependency tree for the running example.
Definition 3 (Dependent Chunk of ROOT). Dependent Chunk of ROOT (DCR) is the
longest sequence of tokens of a given tweet that satisfies the following conditions: (i)
There is one token that is (directly) dependent on the root, and (ii) any other token
included in a given chunk is dependent on a token already within the given chunk.
Moreover, ROOT is an individual chunk.
Example 2 (Chunking Tweet). We chunk the running example based on the concept
of ROOT Dependent Chunk (RDC). Figure 3 shows the resulting chunks. Except for
the chunk of root (because root is an individual chunk), any other chunk has only one
token that is dependent on the root (only one outgoing arrow to the root) and other
tokensinside that chunk co-reference interior tokens(interior arrows). According to this
definition, theexample tweet contains four individual chunks. For thechunk ‘Instagram
48. Annotation Evolution
Metadata
Annota on
Linguis c
Annota on
Interoperability
Annota on
Cogni ve
Annota on
PROV Ontology
Dublin Core Meta
Data
OLiA Ontologies
Language Annota on
Framework (LAF)
MEX (Machine
Learning)
QANARY (Ques on
Answering)
NLP Interchange
Format (NIF)
CEVO (Comprehensive
Event Ontology)
Universal Conceptual
Cogni ve Annota on
(UCCA)
2/10/2017
48
49. CEVO use case 1: Annotating Text
BBC Tweet#1 on 10/3/2016:
Obama and Justin Trudeau announce efforts to fight climate change.
NYT Tweet#2 14/3/2016:
State elections were "difficult day," German Chancellor Angela Merkel says.
CEVO:Communication
CEVO:Communication
2/10/2017
49
50. CEVOuse case2: Annotating Ontological
Properties
We use Web Annotation Data Model (WADM) for annotating
ontological properties.
example:annotation1 a oa:Annotation
oa:hasTarget dbo:spouse
oa:hasBody cevo:Amalgamate
2/10/2017
50
51. CEVOuse case3: Relation Linking
• Example: Rupert Murdoch and Jerry Hall marry.
<exam:headline#char=31,35> a nif:String ;
nif:beginIndex 31 ;
nif:endIndex 35 ;
nif:anchorOf "marry" ;
nif:oliaCategory Olia:MainVerb .
a cevo:Amalgamate .
example:annotation3 a oa:Annotation ;
oa:hasTarget exam:headline#char=31,35 ;
oa:hasBody dbo:spouse .
2/10/2017
51
Notas do Editor
we encounter two issues. First, we need to find a set
of IRIs corresponding to each keyword. Second, we have to
construct suitable triple patterns based on the anchor points
extracted previously so as to retrieve appropriate data.
Figure 1 shows an overview of our approach. Our approach
firstly retrieves relevant IRIs related to each user-supplied keyword
from the underlying knowledge base and secondly injects
them to a series of graph pattern templates for constructing
formal queries. So as to find these relevant IRIs, the following
two steps are carried out
categorization is basedon the matter of information which is retrieved from theknowledge base.
Finding special characteristics of an instance: Datatype
properties which emanate from instances/classes to literals or
simple types and also some kinds of object properties state
characteristics of an entity and information around them. So, in
the simplest case of a query, a user intends to retrieve specific
information of an entity such as “Population of Canada” or
7
“Language of Malaysia”. Since this information is explicit,
the simple graph patterns IP.P1, IP.P4 and IP.P6 can be used
for retrieving this kind of information.
Finding similar instances: In this case, the user asks
for a list of instances which have a specific characteristic in
common. Examples for these type of queries are: ”Germany
Island” or ”Countries with English as official language”. A
possible graph structure capturing potential answers for this
query type is depicted in Figure 4. It shows a set of instances
from the same class which have a certain property in common.
Graph pattern templates CI.P7, CI.P8, and CP.P14 retrieve this
kind of information.
Fig. 4. Similar instances with an instance in common.
Finding associations between instances: Associations between
instances in knowledge bases are defined as a sequence
of properties and instances connecting two given instances
(cf. Figure 5). Therefore, each association contains a set of
instances and object properties connecting them which is
the purpose of the user query. As an example, the query
Volkswagen Porsche can be used to find associations between
the two car makers. The graph pattern templates II.P9 and
II.P10 extract these associations.
An observation reports that basically the hyponym (Words representing a specialization of the input word.) relationship leads at deriving a large number of terms whereas their contribution to the vocabulary mismatch task is trivial.
In other words, we check the occurrence of each word w by sub-string matching with all literals (L) of the underlying RDF knowledge base. Then, simply if no occurrence is observed, the word w is removed from W .
We have to redefine traditional concepts from IR,
1. One of them is the concept of co-occurrence, IN TIR , two terms are co-occuring when they are appearing in a specific window, paragraph or document
2. The concept of frequency needs be adapted,