Graphinder semantic search

Graphinder Semantic Search
Relational Keyword Search over Data Graphs
Thanh Tran, Lei Zhang, Veli Bicer, Yongtao Ma
Researcher: www.sites.google.com/site/kimducthanh
Co-Founder: www.graphinder.com

Agenda
•
•
•
•
•

Introduction
Graphinder: Overview
Keyword Query Translation
Keyword Query Result Ranking
Keyword Query Rewriting
– Suggesting correct and meaningful queries
– Auto-complete as user types

Motivation: lots of structured data

Semantic Search: use information about entities and
relationships explicitly given in structured data to provide
relevant answers for complex questions asked using
intuitive interfaces
“singles written by freddie, who is
member of the band queen”
“single written by freddie queen”

MusicBrainz
Single

Artist
Queen

Person

Queen
Elizabeth 1

<x, type, Single>
<Freddie Mercury, writer, x>
<Freddie Mercury, type, Artist>
<Freddie Mercury, member, Queen>
<Queen, type, Band>

DBpedia
Freddie
Mercury

Brian
May
writer

Liar

1971

single

<x, type, Single>
<x, wrritenBy, Freddy>

Links

<Freddy, same-as, Freddy Mercury>

Entity Semantic Search: find relevant entity, return
structured data summary, facts, related entities

Relational Semantic Search: find relevant entities
involved in a relationship, return entity summaries…

Semantic Search Problem: understand user inputs as
entities and relationships and find relevant answers

“single written by freddie queen”
“singles written by freddie, who is
member of the band queen”
Single

Artist

Queen

Freddie
Mercury

Brian
May
writer

Person

Queen
Elizabeth 1

Liar

1971

single

Query Translation: What are possible
connections (schema-level) between
recognized entities and relationships?
1)
<x, type, Single>
2)
….
Query Answering: What are actual
connections (data-level) between
recognized entities and relationships?
1)
<Liar Liar, type, Single>
<Freddie Mercury, writer, Liar Liar>
2)
…

Relational Semantic Search at Facebook: recognizes entities and
relationships via LMs, uses manually specified template (grammar) to
find possible connections between them and computes answers via
resulting translated queries
“my friends, who is member of queen”
[start]
my friends, who is member of [id:Queen1]
friends(x,me), member(x,Queen1)
[user-head]
my friends
friends(x,me)

[user-filter]
who is member of [id:1]
member(x,Queen1)
[who]
who
-

[member-vp]
is member of [id:1]
member(x,Queen1)
[member-of-v]
is member of
member()

friends

member

{band}
[id:Queen1]
Queen1

queen

Grammar: set of production rules,
capturing all possible connections,
i.e. the search space of all parse trees
[start]  [users]
[users]  my friends
friends(x, me)
[…]  is member of [bands]
member(x, $1)
[bands]  {band}
$1
…
Grammar-based Query Translation:
which combination of production
rules results in a parse tree that
connects the recognized entities and
relationships?

Graphinder Semantic Search: a translation-based approach
for relational keyword search over data graphs

Single

Artist

Person

Queen

Queen Elizabeth 1

Freddie Mercury

Brian
May

Liar

1971

single

writer

Sem. Auto-completion

Query Translation
- Entity + Relationships
- Multi-source
- Domain-independent
- Low manual effort

Graphinder: selected publications
• On-demand, domain-independent, relational keyword search
over data graphs
–
–
–
–

Structure index for data graphs (TKDE13b)
Top-k exploration of translation candidates (ICDE09)
Index-based materialization of graphs (CIKM11a)
Ranking results using structured relevance model (SRM) (CIKM11b)

• Multi-source
– Deduplication using inferred type information: TYPifier (ICDE13),
TYPimatch (WSDM13)
– On-the-fly deduplication using SRM (WWW11)
– Ranking with deduplication (ISWC13)
– Routing keyword queries to relevant data graphs (TKDE13a)
– Hermes: keyword search over heterogeneous data graphs
(SIGMOD09)

• Semantic auto-completion
– Computing valid query rewrites for given keywords (VLDB14)

0) Query Translation: constructing pseudo schema graph
representing all possible connections between data elements
•

•

•

Structure index for data graph:
nodes are groups of data elements
that are share same structure
pattern
Parameters: structure pattern with
edge labels L and paths of maximum
length n
Pseudo schema
– Node groups all instances that have
same set of properties
– structure pattern: all properties, i.e.
all outgoing paths with n = 1, L = all
edge labels

•

Algorithm:
– Start with one single partition/node
representing all instances
– Spit until all nodes are “stable”, i.e.,
all contained instances share same
structure pattern

Single

Artist
Queen

Freddie
Mercury

Brian
May

Person

Queen
Elizabeth 1

Liar

single

writer

member

Artist

producer
Thing12

writer
Single

marital status
Person

Value2

1) Query Translation: constructing search space
representing all possible interpretations of query keywords
“written by freddie queen single”
Freddie
Mercury

Queen
Elizabeth 1

Artist

Freddie
Mercury

producer

Band

Queen

Data
Index

single

writer

member

Queen

Single

Single

Schema
Index

marital status

writer

Keyword Interpretation: use inverted
index and LM-based ranking function to
return relevant schema and data
elements

Person

Literal

Queen
Elizabeth 1

single

Search Space Construction: augment
pseudo schema with query-specific
keyword matching elements
• All possible connections of predicates
applicable to recognized query
keywords
Top-k Subgraph Exploration
Result Retrieval & Ranking

2) Query Translation: score-directed algorithm for finding
top-k subgraphs connecting keyword matching elements
“written by freddie queen single”

member
Artist

Freddie
Mercury

•
•
•

•

•
•

producer
Band

Queen

marital status

writer
Single

Person

Literal

Queen
Elizabeth 1

single

<x, type, Single>
<Queen, producer, x>
<Queen, type, Band>
<Freddy Mercury, type, Artist>

Algorithm: score-directed top-k Steiner graph search
Start: explore all distinct paths starting from keyword elements
Every iteration
• One step expansion of current path with highest score
• When connecting element found, merge paths and add resulting graph to list
Top-k termination: lowest score of the candidate list > highest possible score that
can achieved with paths in the queues yet to be explored
Termination: all paths of maximum length d have been explored
Final step: mapping rules to translate Steiner graph to structured query

Ranking Using Structured LMs: Keyword query is short and
ambiguous, while structured data provide rich structure
information: ranking based on LMs capturing both content and
structure

• Structured LMs for
structured results r
• Structured LM for queries
using structured pseudorelevant feedback results FR
(relevance model)
• Compute distance between
query and result LMs

RM r (v )

P(v | r )

RMFr (v)

P(v | Fr )

Score( r )

RM Fr ( v ) log RM r ( v )
v V

Relevance Models
freddie queen
Query
F Documents

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West

• Term probabilities of query model is
based on documents
• Ranking behaves like similarity search
between pseudo-relevant feedback
documents and corpus documents

Candidate Documents

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West

Structured Relevance Models
Structured Data

queen single
Query

F Results

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West

• Term probabilities of query model is
based on pseudo-relevant structured data
• Ranking behaves like similarity search
between pseudo-relevant structured
results and structured result
candidates
Structured Data

Candidate Results

Merc
ury
Brian
May
Prote
st
Raid
Clas
h
Bank
West

Ranking: construct edge-specific query model for each unique e
from feedback resources FR, edge-specific model for every
candidate r, and finally, compute distance
For all
resources r
in FR

Prob of observing
term v in value of
property e of
resource r

RMname

RMcomment

RMx

Mercury

.091

.01

…

Brian

.082

.01

…

Champion
Importance of resource r w.r.t. query

v

.081

.02

…

Protest

.001

.042

…

Raid

.006

.014

…

…

…

…

…

v

RMname

RMcomment

RMx

Mercury

.073

.01

…

Brian

.052

.01

…

…

…

…

…

Query Rewriting: find syntactically and semantically valid
rewrites to suggest as user types
single from freddy mercury que
Freddie
Mercury

Queen
Elizabeth 1

Queen

single

writer

Single

Data
Index
Schema
Index

Benefits:
- Higher selectivity of query terms (quality)
- Reduced number of query terms (efficiency)
- Better search experience…
Freddie
Mercury

Data
Index

Queen

writer

Single

Schema
Index

Challenges: many rewrite candidates, some are
semantically not “valid” in the relational setting
single (marital status) writer “freddie mercury” queen
(the queen of UK)

Token rewriting via syntactic distance
Keyword Interpretation:
- Imprecise / fuzzy matching
1) single from freddie mercury queen
- Match every keyword
…
Token rewriting via semantic distance
1) single writer freddie mercury queen
…

Query segmentation
1) single writer “freddie mercury” queen
…

Keyword / Key Phrase Interpretation:
- Precise matching
- Match keyword and key phrases
Search Space Construction
Search Space Construction
Result Retrieval & Ranking

Probabilistic Model for Query Rewriting: the rank of a
query rewrite (suggestion) S is based on the
probability of observing S in the data, given the query
Based on
Bayes„ Theorem

Probability
users write
spelling errors
/ semantically
related query
independent of
data D

single writer freddy mercury que

1) single writer freddie mercury queen
2) single writer freddrick mercury monarch
3) song writer freddrick mercury head of state

Constant
given query Q
and data D

Single

Artist

Person

Queen

Queen Elizabeth 1

Token Rewriting: S is
ranked high when prob
that query Q can be
observed in S is high

Query Segmentation: S is
ranked high when prob that
S can be observed in the
data D is high

Freddie Mercury

Brian
May

Liar

writer

1971

single

Token Rewriting
• Modeling token rewriting P(Q|S)

Split: |
Concatenate: +

• Independence assumption

• Modeling syntactic and semantic differences

P(q|t): is high when q is
syntactically and
semantically close to t

single writer freddy mercury que
1) single writer “freddie mercury” queen
2) single writer “freddrick mercury” monarch
3) single writer “freddrick mercury” head of state

single | writer | freddie + mercury | queen

Query Segmentation
• Modeling query segmentation P(S|D)
single writer freddie mercury que

α = concatenate?
α = split?
where PD(αiti+1|t1α1t2…αi-1ti)
stands for P(αiti+1|t1α1t2…αi-1ti,D).

Singl
e

Art
ist

single writer freddie

Queen Elizabeth 1

Freddie
Mercury

Brian
May

Liar

writer

• Nth order Markov assumption

Person

Queen

1
9
7
1

single

Estimating Probability of Segmentation
• Maximum likelihood estimation (MLE)

where C(ti…tj) denotes the count of occurrences of the token sequence ti…tj

Segmentation in structured data setting
• Concatenate two segments si and sj when they co-occur in the data
• Split when si and sj are connected (si ↭ sj), i.e., when the two data
elements ni and ni mentioning si and sj are connected in the data
single writer freddie mercury queen

Single

Artist

α = concatenate?
α = split?

single writer freddie

Person

Queen

Freddie
Mercury

Brian
May
writer

Queen
Elizabeth 1

Liar

1971

single

Estimating Probability of Segmentation Case 1: previous
segment si has length equal or more than context N
• Two cases: (1) l(si) ≥ N; (2) l(si) < N
• (1) When the previously induced segment si has length equal or
more than N, i.e. l(si) ≥ N, it suffices to focus on si (N) to predict
the next action αi on ti+1
freddie j. mercury

queen

freddie j. mercury

queen

• Estimation of probability

where C(st) denotes the count of co-occurrences of the sequence st in D and
C(s ↭ t) is the count of all occurrences of token t connected to segment s

Estimating Probability of Segmentation Case 2: previous
segment si has length less than context N
• (2) When the previous segment si has length less than N, i.e. l(si) <
N, the action αi on the next token ti+1 depends on si and Pi(N), the
set of segments that precede si that together with si, contains at
most N tokens in total, i.e.,
single

writer

freddie

mercury single

writer

freddie

mercury

• Estimation of probability

where C(P ↭ s) denotes the count of all occurrences of the segment s
connected to all segments in P

EXPERIMENTAL RESULTS &
CONCLUSIONS

• Graphinder, a relational keyword search approach for suggesting query
•

•

•

•
•

completions, translating queries and ranking results
Keyword translation performance
– Query translation and index-based approaches at least one-order of magnitude
faster than online in-memory search (bidirectional)
– Query translation comparable with index-based approaches, but less space
Keyword translation result quality
– According to recent benchmark, our ranking consistently outperforms all
existing ranking systems in precision, recall and MAP (10% - 30% improvement)
Effect of query rewriting
– Better user experience
– Improves efficiency by reducing number of query terms
– Improves quality / selectivity of query terms
– …depends on complexity of queries and underlying keyword search engine
Tight integration of query suggestion and translation
From research prototypes to Graphinder, a powerful, flexible, low upfront-cost
semantic search system

Thanks!

Tran Duc Thanh
tran.du.th@gmail.com
http://sites.google.com/site/kimducthanh/

References (1)
– [VLDB14] Yongtao Ma, Thanh Tran
Probabilistic Query Rewriting for Efficient and and Effective Keyword Search on
Graph Data
In International Conference on Very Large Data Bases (VLDB'14). Hangzhou,
China, September, 2014
– [ISWC13] Daniel Herzig, Roi Blanco, Peter Mika and Thanh Tran
Federated Entity Search Using On-the-Fly Consolidation
In International Semantic Web Conference (ISWC'13). Sydney, Australia, October,
2013
– [ICDE13] Yongtao Ma, Thanh Tran
TYPifier: Inferring the Type Semantics of Structured Data
In International Conference on Data Engineering (ICDE'13). Brisbane, Australia, April,
2013
– [WSDM13] Yongtao Ma, Thanh Tran
TYPiMatch: Type-specific Unsupervised Learning of Keys and Key Values for
Heterogeneous Web Data Integration
In International Conference on Web Search and Data Mining (WSDM'13). Rome,
Italy, February, 2013
– [TKDE12a] Thanh Tran, Günter Ladwig, Sebastian Rudolph
Managing Structured and Semi-structured RDF Data Using Structure Indexes
In Transactions on Knowledge and Data Engineering journal.
– [TKDE12b] Thanh Tran, Lei Zhang
Keyword Query Routing
In Transactions on Knowledge and Data Engineering journal.

References (2)
– [WWW12] Daniel Herzig, Thanh Tran
Heterogeneous Web Data Search Using Relevance-based On The Fly Data Integration
In Proceedings of 21st International World Wide Web Conference (WWW'12). Lyon,
France, April, 2012
– [CIKM11a] Günter Ladwig, Thanh Tran
Index Structures and Top-k Join Algorithms for Native Keyword Search Databases
In Proceedings of 20th ACM Conference on Information and Knowledge
Management (CIKM'11). Glasgow, UK, October, 2011
– [CIKM11b] Veli Bicer, Thanh Tran
Ranking Support for Keyword Search on Structured Data using Relevance Models
In Proceedings of 20th ACM Conference on Information and Knowledge
Management (CIKM'11). Glasgow, UK, October, 2011
– [SIGIR11] Roi Blanco, Harry Halpin, Daniel M. Herzig, Peter Mika, Jeffrey
Pound, Henry S. Thompson, Thanh Tran Duc
Repeatable and Reliable Search System Evaluation using Crowdsourcing
In Proceedings of 34th Annual International ACM SIGIR Conference (SIGIR'11),
Beijing, China, July, 2011
– [ICDE09] Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
In Proceedings of the 25th International Conference on Data Engineering (ICDE'09).
Shanghai, China, March 2009
– [SIGMOD09] Haofen Wang, Thomas Penin, Kaifeng Xu, Junquan Chen, Xinruo Sun,
Linyun Fu, Yong Yu, Thanh Tran, Peter Haase, Rudi Studer
Hermes: A Travel through Semantics in the Data Web
In Proceedings of SIGMOD Conference 2009. Providence, USA, June-July, 2009

Graphinder semantic search

Recommended

Recommended

More Related Content

Similar to Graphinder semantic search

Similar to Graphinder semantic search (20)

Recently uploaded

Recently uploaded (20)

Graphinder semantic search

Editor's Notes