1. Graphinder Semantic Search
Relational Keyword Search over Data Graphs
Thanh Tran, Lei Zhang, Veli Bicer, Yongtao Ma
Researcher: www.sites.google.com/site/kimducthanh
Co-Founder: www.graphinder.com
5. Semantic Search: use information about entities and
relationships explicitly given in structured data to provide
relevant answers for complex questions asked using
intuitive interfaces
“singles written by freddie, who is
member of the band queen”
“single written by freddie queen”
MusicBrainz
Single
Artist
Queen
Person
Queen
Elizabeth 1
<x, type, Single>
<Freddie Mercury, writer, x>
<Freddie Mercury, type, Artist>
<Freddie Mercury, member, Queen>
<Queen, type, Band>
DBpedia
Freddie
Mercury
Brian
May
writer
Liar
1971
single
<x, type, Single>
<x, wrritenBy, Freddy>
Links
<Freddy, same-as, Freddy Mercury>
6. Entity Semantic Search: find relevant entity, return
structured data summary, facts, related entities
8. Semantic Search Problem: understand user inputs as
entities and relationships and find relevant answers
“single written by freddie queen”
“singles written by freddie, who is
member of the band queen”
Single
Artist
Queen
Freddie
Mercury
Brian
May
writer
Person
Queen
Elizabeth 1
Liar
1971
single
Query Translation: What are possible
connections (schema-level) between
recognized entities and relationships?
1)
<x, type, Single>
<Freddie Mercury, writer, x>
<Freddie Mercury, member, Queen>
2)
….
Query Answering: What are actual
connections (data-level) between
recognized entities and relationships?
1)
<Liar Liar, type, Single>
<Freddie Mercury, writer, Liar Liar>
<Freddie Mercury, member, Queen>
2)
…
9. Relational Semantic Search at Facebook: recognizes entities and
relationships via LMs, uses manually specified template (grammar) to
find possible connections between them and computes answers via
resulting translated queries
“my friends, who is member of queen”
[start]
my friends, who is member of [id:Queen1]
friends(x,me), member(x,Queen1)
[user-head]
my friends
friends(x,me)
[user-filter]
who is member of [id:1]
member(x,Queen1)
[who]
who
-
[member-vp]
is member of [id:1]
member(x,Queen1)
[member-of-v]
is member of
member()
friends
member
{band}
[id:Queen1]
Queen1
queen
Grammar: set of production rules,
capturing all possible connections,
i.e. the search space of all parse trees
[start] [users]
[users] my friends
friends(x, me)
[…] is member of [bands]
member(x, $1)
[bands] {band}
$1
…
Grammar-based Query Translation:
which combination of production
rules results in a parse tree that
connects the recognized entities and
relationships?
11. Graphinder Semantic Search: a translation-based approach
for relational keyword search over data graphs
Single
Artist
Person
Queen
Queen Elizabeth 1
Freddie Mercury
Brian
May
Liar
1971
single
writer
Sem. Auto-completion
Query Translation
- Entity + Relationships
- Multi-source
- Domain-independent
- Low manual effort
12. Graphinder: selected publications
• On-demand, domain-independent, relational keyword search
over data graphs
–
–
–
–
Structure index for data graphs (TKDE13b)
Top-k exploration of translation candidates (ICDE09)
Index-based materialization of graphs (CIKM11a)
Ranking results using structured relevance model (SRM) (CIKM11b)
• Multi-source
– Deduplication using inferred type information: TYPifier (ICDE13),
TYPimatch (WSDM13)
– On-the-fly deduplication using SRM (WWW11)
– Ranking with deduplication (ISWC13)
– Routing keyword queries to relevant data graphs (TKDE13a)
– Hermes: keyword search over heterogeneous data graphs
(SIGMOD09)
• Semantic auto-completion
– Computing valid query rewrites for given keywords (VLDB14)
14. 0) Query Translation: constructing pseudo schema graph
representing all possible connections between data elements
•
•
•
Structure index for data graph:
nodes are groups of data elements
that are share same structure
pattern
Parameters: structure pattern with
edge labels L and paths of maximum
length n
Pseudo schema
– Node groups all instances that have
same set of properties
– structure pattern: all properties, i.e.
all outgoing paths with n = 1, L = all
edge labels
•
Algorithm:
– Start with one single partition/node
representing all instances
– Spit until all nodes are “stable”, i.e.,
all contained instances share same
structure pattern
Single
Artist
Queen
Freddie
Mercury
Brian
May
Person
Queen
Elizabeth 1
Liar
single
writer
member
Artist
producer
Thing12
writer
Single
marital status
Person
Value2
15. 1) Query Translation: constructing search space
representing all possible interpretations of query keywords
“written by freddie queen single”
Freddie
Mercury
Queen
Elizabeth 1
Artist
Freddie
Mercury
producer
Band
Queen
Data
Index
single
writer
member
Queen
Single
Single
Schema
Index
marital status
writer
Keyword Interpretation: use inverted
index and LM-based ranking function to
return relevant schema and data
elements
Person
Literal
Queen
Elizabeth 1
single
Search Space Construction: augment
pseudo schema with query-specific
keyword matching elements
• All possible connections of predicates
applicable to recognized query
keywords
Top-k Subgraph Exploration
Result Retrieval & Ranking
16. 2) Query Translation: score-directed algorithm for finding
top-k subgraphs connecting keyword matching elements
“written by freddie queen single”
member
Artist
Freddie
Mercury
•
•
•
•
•
•
producer
Band
Queen
marital status
writer
Single
Person
Literal
Queen
Elizabeth 1
single
<x, type, Single>
<Queen, producer, x>
<Freddie Mercury, writer, x>
<Queen, type, Band>
<Freddy Mercury, type, Artist>
Algorithm: score-directed top-k Steiner graph search
Start: explore all distinct paths starting from keyword elements
Every iteration
• One step expansion of current path with highest score
• When connecting element found, merge paths and add resulting graph to list
Top-k termination: lowest score of the candidate list > highest possible score that
can achieved with paths in the queues yet to be explored
Termination: all paths of maximum length d have been explored
Final step: mapping rules to translate Steiner graph to structured query
18. Ranking Using Structured LMs: Keyword query is short and
ambiguous, while structured data provide rich structure
information: ranking based on LMs capturing both content and
structure
• Structured LMs for
structured results r
• Structured LM for queries
using structured pseudorelevant feedback results FR
(relevance model)
• Compute distance between
query and result LMs
RM r (v )
P(v | r )
RMFr (v)
P(v | Fr )
Score( r )
RM Fr ( v ) log RM r ( v )
v V
19. Relevance Models
freddie queen
Query
F Documents
Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West
• Term probabilities of query model is
based on documents
• Ranking behaves like similarity search
between pseudo-relevant feedback
documents and corpus documents
Candidate Documents
Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West
20. Structured Relevance Models
Structured Data
queen single
Query
F Results
Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West
• Term probabilities of query model is
based on pseudo-relevant structured data
• Ranking behaves like similarity search
between pseudo-relevant structured
results and structured result
candidates
Structured Data
Candidate Results
Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West
21. Ranking: construct edge-specific query model for each unique e
from feedback resources FR, edge-specific model for every
candidate r, and finally, compute distance
For all
resources r
in FR
Prob of observing
term v in value of
property e of
resource r
RMname
RMcomment
RMx
Mercury
.091
.01
…
Brian
.082
.01
…
Champion
Importance of resource r w.r.t. query
v
.081
.02
…
Protest
.001
.042
…
Raid
.006
.014
…
…
…
…
…
v
RMname
RMcomment
RMx
Mercury
.073
.01
…
Brian
.052
.01
…
…
…
…
…
23. Query Rewriting: find syntactically and semantically valid
rewrites to suggest as user types
single from freddy mercury que
Freddie
Mercury
Queen
Elizabeth 1
Queen
single
writer
Single
Data
Index
Schema
Index
Benefits:
- Higher selectivity of query terms (quality)
- Reduced number of query terms (efficiency)
- Better search experience…
Freddie
Mercury
Data
Index
Queen
writer
Single
Schema
Index
Challenges: many rewrite candidates, some are
semantically not “valid” in the relational setting
single (marital status) writer “freddie mercury” queen
(the queen of UK)
Token rewriting via syntactic distance
Keyword Interpretation:
- Imprecise / fuzzy matching
1) single from freddie mercury queen
- Match every keyword
…
Token rewriting via semantic distance
1) single writer freddie mercury queen
…
Query segmentation
1) single writer “freddie mercury” queen
…
Keyword / Key Phrase Interpretation:
- Precise matching
- Match keyword and key phrases
Search Space Construction
Search Space Construction
Result Retrieval & Ranking
24. Probabilistic Model for Query Rewriting: the rank of a
query rewrite (suggestion) S is based on the
probability of observing S in the data, given the query
Based on
Bayes„ Theorem
Probability
users write
spelling errors
/ semantically
related query
independent of
data D
single writer freddy mercury que
1) single writer freddie mercury queen
2) single writer freddrick mercury monarch
3) song writer freddrick mercury head of state
Constant
given query Q
and data D
Single
Artist
Person
Queen
Queen Elizabeth 1
Token Rewriting: S is
ranked high when prob
that query Q can be
observed in S is high
Query Segmentation: S is
ranked high when prob that
S can be observed in the
data D is high
Freddie Mercury
Brian
May
Liar
writer
1971
single
25. Token Rewriting
• Modeling token rewriting P(Q|S)
Split: |
Concatenate: +
• Independence assumption
• Modeling syntactic and semantic differences
P(q|t): is high when q is
syntactically and
semantically close to t
single writer freddy mercury que
1) single writer “freddie mercury” queen
2) single writer “freddrick mercury” monarch
3) single writer “freddrick mercury” head of state
single | writer | freddie + mercury | queen
26. Query Segmentation
• Modeling query segmentation P(S|D)
single writer freddie mercury que
α = concatenate?
α = split?
where PD(αiti+1|t1α1t2…αi-1ti)
stands for P(αiti+1|t1α1t2…αi-1ti,D).
Singl
e
Art
ist
single writer freddie
Queen Elizabeth 1
Freddie
Mercury
Brian
May
Liar
writer
• Nth order Markov assumption
Person
Queen
1
9
7
1
single
27. Estimating Probability of Segmentation
• Maximum likelihood estimation (MLE)
where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj
Segmentation in structured data setting
• Concatenate two segments si and sj when they co-occur in the data
• Split when si and sj are connected (si ↭ sj), i.e., when the two data
elements ni and ni mentioning si and sj are connected in the data
single writer freddie mercury queen
Single
Artist
α = concatenate?
α = split?
single writer freddie
Person
Queen
Freddie
Mercury
Brian
May
writer
Queen
Elizabeth 1
Liar
1971
single
28. Estimating Probability of Segmentation Case 1: previous
segment si has length equal or more than context N
• Two cases: (1) l(si) ≥ N; (2) l(si) < N
• (1) When the previously induced segment si has length equal or
more than N, i.e. l(si) ≥ N, it suffices to focus on si (N) to predict
the next action αi on ti+1
freddie j. mercury
queen
freddie j. mercury
queen
• Estimation of probability
where C(st) denotes the count of co-occurrences of the sequence st in D and
C(s ↭ t) is the count of all occurrences of token t connected to segment s
29. Estimating Probability of Segmentation Case 2: previous
segment si has length less than context N
• (2) When the previous segment si has length less than N, i.e. l(si) <
N, the action αi on the next token ti+1 depends on si and Pi(N), the
set of segments that precede si that together with si, contains at
most N tokens in total, i.e.,
single
writer
freddie
mercury single
writer
freddie
mercury
• Estimation of probability
where C(P ↭ s) denotes the count of all occurrences of the segment s
connected to all segments in P
31. • Graphinder, a relational keyword search approach for suggesting query
•
•
•
•
•
completions, translating queries and ranking results
Keyword translation performance
– Query translation and index-based approaches at least one-order of magnitude
faster than online in-memory search (bidirectional)
– Query translation comparable with index-based approaches, but less space
Keyword translation result quality
– According to recent benchmark, our ranking consistently outperforms all
existing ranking systems in precision, recall and MAP (10% - 30% improvement)
Effect of query rewriting
– Better user experience
– Improves efficiency by reducing number of query terms
– Improves quality / selectivity of query terms
– …depends on complexity of queries and underlying keyword search engine
Tight integration of query suggestion and translation
From research prototypes to Graphinder, a powerful, flexible, low upfront-cost
semantic search system
33. References (1)
– [VLDB14] Yongtao Ma, Thanh Tran
Probabilistic Query Rewriting for Efficient and and Effective Keyword Search on
Graph Data
In International Conference on Very Large Data Bases (VLDB'14). Hangzhou,
China, September, 2014
– [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran
Federated Entity Search Using On-the-Fly Consolidation
In International Semantic Web Conference (ISWC'13). Sydney, Australia, October,
2013
– [ICDE13] Yongtao Ma, Thanh Tran
TYPifier: Inferring the Type Semantics of Structured Data
In International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April,
2013
– [WSDM13] Yongtao Ma, Thanh Tran
TYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for
Heterogeneous Web Data Integration
In International Conference on Web Search and Data Mining (WSDM'13). Rome,
Italy, February, 2013
– [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian Rudolph
Managing Structured and Semi-structured RDF Data Using Structure Indexes
In Transactions on Knowledge and Data Engineering journal.
– [TKDE12b] Thanh Tran, Lei Zhang
Keyword Query Routing
In Transactions on Knowledge and Data Engineering journal.
34. References (2)
– [WWW12] Daniel Herzig, Thanh Tran
Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration
In Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon,
France, April, 2012
– [CIKM11a] Günter Ladwig, Thanh Tran
Index Structures and Top-k Join Algorithms for Native Keyword Search Databases
In Proceedings of 20th ACM Conference on Information and Knowledge
Management (CIKM'11). Glasgow, UK, October, 2011
– [CIKM11b] Veli Bicer, Thanh Tran
Ranking Support for Keyword Search on Structured Data using Relevance Models
In Proceedings of 20th ACM Conference on Information and Knowledge
Management (CIKM'11). Glasgow, UK, October, 2011
– [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey
Pound, Henry S. Thompson, Thanh Tran Duc
Repeatable and Reliable Search System Evaluation using Crowdsourcing
In Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11),
Beijing, China, July, 2011
– [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
In Proceedings of the 25th International Conference on Data Engineering (ICDE'09).
Shanghai, China, March 2009
– [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun,
Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer
Hermes: A Travel through Semantics in the Data Web
In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009
Construct query model from structured data elements that are close to the queryIndex resources in the data graph where resources are treated as documents and attributes and attribute values are indexed as document terms use standard inverted index implementation and IR search engine to retrieve resources for a given keyword query initial run of the query yields F results
Query model: probability of terms in the query model is estimated using F resources: intuitively, probability of a term is estimated as the probability of observing these terms in the F resources (based on the probability of observing the term in the e-value of r, and the probability of e) Weight by the importance of that resource: a resource is more important if query terms are more likely to be observed in that resources, compared to other resources in FEdge-specific resourcemodel:probability of observingterm v in e-value of r, smoothing with prpobability of observing term v in all values of rThe score of a resource calculated based on cross-entropy of edge-specific RM and edge-specific ResM:Aggrgated over EVERY E: Alpha allows to control the importance of edgesInstead of singleentities, rankingcomplexgraphscomprisingmultupleentities,calledJoinedResultTuple: modelcomplexresultsas a geometricmean of the entitymodelsRanking aggregated JRTs: The cross entropy between the edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms)A language model is constructed for every attribute of the resource to capture the probability of a word being observed via repeated sampling from the content of a specific attribute of rLambda controls the weight of the edge-specific attribute, small value means less emphasis on the term of the attribute and more emphasis on the terms of the entire resource (terms in all attributes)Pe is the probability of observing a word v in the edge specific attribute a P* is the probability of observing a word v in all attributes of rConsider the co-occurences of a word and query words in the content of a specific attribute aThe sampling process we implement is iidiidsamping: query words and w are iid sampled from a unigram distribution a, i.e. representing content of the specific attribute a, then sample v from a, and then sample k times query words from a distribution representing the content of all attributes of r