Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Semantic Search
1. Related Entity Finding
on the Web
Peter Mika
Senior Research Scientist
Yahoo! Research
Joint work with B. Barla Cambazoglu and Roi Blanco
2. - 2 -
Yahoo! serves over 700 million users in 25 countries
3. - 3 -
Yahoo! Research: visit us at research.yahoo.com
4. - 4 -
Yahoo! Research Barcelona
• Established January, 2006
• Led by Ricardo Baeza-Yates
• Research areas
– Web Mining
• content, structure, usage
– Social Media
– Distributed Systems
– Semantic Search
5. - 5 -
Search is really fast, without necessarily being intelligent
6. - 6 -
Why Semantic Search? Part I
• Improvements in IR are harder and harder to come by
– Machine learning using hundreds of features
• Text-based features for matching
• Graph-based features provide authority
– Heavy investment in computational power, e.g. real-time
indexing and instant search
• Remaining challenges are not computational, but in
modeling user cognition
– Need a deeper understanding of the query, the content and/or
the world at large
– Could Watson explain why the answer is Toronto?
7. - 7 -
Poorly solved information needs
• Ambiguous searches
– Multiple interpretations
• jaguar
• paris hilton
– Secondary meaning
• george bush (and I mean the beer brewer in Arizona)
– Subjectivity
• reliable digital camera
• paris hilton sexy
– Imprecise or overly precise searches
• jim hendler
8. - 8 -
Poorly solved information needs
• Complex queries
– Missing information
• brad pitt zombie
• florida man with 115 guns
• peter mika barcelona
• 34 year old computer scientist living in barcelona
– Category queries
• countries in africa
• barcelona nightlife
– Complex needs
• 120 dollars in euros
• digital camera under 300 dollars
• world temperature in 2020
Many of these queries
would not be asked by
users, who learned over
time what search
technology can and can
not do.
Many of these queries
would not be asked by
users, who learned over
time what search
technology can and can
not do.
10. - 10 -
Why Semantic Search? II.
• The Semantic Web is now a reality
– Standardization and development
• Standards from the W3C
• Tools and services from a variety of vendors
– Data
• Large amounts of public data published in RDF
• Investment in private graphs
• Commercial interest
– Bing, Google, Yahoo, Yandex, Facebook
– Need to satisfy end users
• Not trained in logic
• Can’t learn complex query languages such as SPARQL
• May not know the data
• May not be experts in the domain
11. - 11 -
This is not a business model
http://en.wikipedia.org/wiki/Underpants_Gnomes
19. - 19 -
Contextual (pervasive, ambient) search
Yahoo! Connected
TV:
Widget engine
embedded into the
TV
Yahoo! Connected
TV:
Widget engine
embedded into the
TV
Yahoo! IntoNow: recognize
audio and show related content
Yahoo! IntoNow: recognize
audio and show related content
20. - 20 -
Interactive Voice Search
• Apple’s Siri
– Question-Answering
• Variety of backend sources
including Wolfram Alpha and
various Yahoo! services
– Task completion
• E.g. schedule an event
• Google Now
22. - 22 -
Parlance project
• Interactive, conversational voice search
– Complex dialogs around a set of objects
• Restaurant
– Area
– Price range
– Type of cuisine
– Complete system (mixed licence)
• Automated Speech Recognition (ASR)
• Spoken Language Understanding (SLU)
• Interaction Management
• Knowledge Base
• Natural Language Generation (NLG)
• Text-to-Speech (TTS)
– Video
• Commercial alternatives from Nuance
26. - 26 -
Semantic Search: a definition
• Semantic search is a retrieval paradigm where
– User intent and resources are represented using semantic models
• Not just symbolic representations
– Semantic models are exploited in the matching and ranking of resources
• Often a hybrid of document and data retrieval
– Documents with metadata
• Metadata may be embedded inside the document
• I’m looking for documents that mention countries in Africa.
– Data retrieval
• Structured data, but searchable text fields
• I’m looking for directors, who have directed movies where the synopsis
mentions dinosaurs.
• Wide range of semantic search systems
– Employ different semantic models, possibly at different steps of the search
process and in order to support different tasks
27. - 27 -
Note about explicit vs. implicit semantics
• Semantic Search methods focus on explicit semantics
– Elements of text or data are mapped to a substantially different
representation according to some theory of language or some
theory of the world
• As opposed to “bottom-up” or implicit semantics
– e.g. Latent Semantic Analysis (LSA)
28. - 28 -
Semantic models
• Semantics is concerned with the meaning of the resources made
available for search
– Various representations of meaning
• Linguistic models: models of relationships among words
– Taxonomies, thesauri, dictionaries of entity names
– Natural language structures extracted from text, e.g. using dependency
parsing
– Inference along linguistic relations, e.g. broader/narrower terms, textual
entailment
• Conceptual models: models of relationships among objects
– Ontologies capture entities in the world and their relationships
– Words and phrases in text or records in a database are identified as
representations of ontological elements
– Inference along ontological relations, e.g. logical entailment
29. - 29 -
Semantic Search – a process view
Document Representation
Knowledge Representation
Semantic Models
Resources
Documents
30. - 30 -
Semantic Search as a research field
• Intersection of Semantic Web, Information Retrieval, Natural Language
Processing and Databases
• Emerging field of research
– Workshops
• Exploiting Semantic Annotations in Information Retrieval (2008-2012) at CIKM
• Semantic Search (SemSearch) workshop series (2008-2011) at ESWC and WWW
• Entity-oriented search workshop (2010-2011) at SIGIR
• Joint Intl. Workshop on Semantic and Entity-oriented Search (2012) at SIGIR
• Semantic Search Workshop (2011-2013) at VLDB
– Special Issues of journals
– Surveys
• Related fields:
– XML retrieval
– Keyword search in databases
– Natural Language Retrieval
31. - 31 -
Comparison to IR and DB
• Information Retrieval (IR) support the retrieval of documents
(document retrieval)
– Representation based on lightweight syntax-centric models
– Work well for topical search
– Not so well for more complex information needs
– Web scale
• Database (DB) and Knowledge-based Systems (KB) deliver more
precise answers (data retrieval)
– More expressive models
– Allow for complex queries
– Retrieve concrete answers that precisely match queries
– Not just matching and filtering, but also joins and more complex
inference
– Limitations in scalability
32. - 32 -
Search is required in the presence of ambiguity
Query
Data
KeywordsKeywords
NL
Questions
NL
Questions
Form- / facet-
based Inputs
Form- / facet-
based Inputs
Structured Queries
(SPARQL)
Structured Queries
(SPARQL)
OWL ontologies with
rich, formal
semantics
OWL ontologies with
rich, formal
semantics
Structured
RDF data
Structured
RDF data
Semi-
Structured
RDF data
Semi-
Structured
RDF data
RDF data
embedded in
text (RDFa)
RDF data
embedded in
text (RDFa)
Ambiguities: interpretation
Ambiguities: interpretation, extraction errors, data quality, confidence/trust
33. - 33 -
list search
related entity finding
entity search
SemSearch 2010/11
list completion
SemSearch 2011
TREC ELC taskTREC REF-LOD task
semantic search
Common tasks in Semantic Search
35. - 35 -
Motivation
• Some users are short on time
– Need for direct answers
– Query expansion, question-answering, information boxes, rich
results…
• Other users have time at their hand
– Long term interests such as sports, celebrities, movies and
music
– Long running tasks such as travel planning
37. - 37 -
Spark: related entity recommendations in web search
• A search assistance tool for exploration
• Recommend related entities given the user’s current query
– Cf. Entity Search at SemSearch, TREC Entity Track
• Ranking explicit relations in a Knowledge Base
– Cf. TREC Related Entity Finding in LOD (REF-LOD) task
• A previous version of the system live since 2010
• van Zwol et al.: Faceted exploration of image search results. WWW
2010: 961-970
41. - 41 -
High-Level Architecture View
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
42. - 42 -
Spark Architecture
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
45. - 45 -
Entity graph challenges
• Coverage of the query volume
– New entities and entity types
– Additional inference
– International data
– Aliases, e.g. jlo, big apple, thomas cruise mapother iv
• Freshness
– People query for a movie long before it’s released
• Irrelevant entity and relation types
– E.g. voice actors who co-acted in a movie, cities in a continent
• Data quality
– United States Senate career of Barack Obama is not a person
– Andy Lau has never acted in Iron Man 3
47. - 47 -
Feature extraction from text
• Text sources
– Query terms
– Query sessions
– Flickr tags
– Tweets
• Common representation
Input tweet:
Brad Pitt married to Angelina Jolie in Las Vegas
Output event:
Brad Pitt + Angelina Jolie
Brad Pitt + Las Vegas
Angelina Jolie + Las Vegas
48. - 48 -
Features
• Unary
– Popularity features from text: probability, entropy, wiki id
popularity …
– Graph features: PageRank on the entity graph, wikipedia, web
graph
– Type features: entity type
• Binary
– Co-occurrence features from text: conditional probability, joint
probability …
– Graph features: common neighbors …
– Type features: relation type
49. - 49 -
Feature extraction challenges
• Efficiency of text tagging
– Hadoop Map/Reduce
• More features are not always better
– Can lead to over-fitting without sufficient training data
51. - 51 -
Model Learning
• Training data created by editors (five grades)
400 Brandi adriana lima Brad Pitt person Embarassing
1397 David H. andy garcia Brad Pitt person Mostly Related
3037 Jennifer benicio del toro Brad Pitt person Somewhat Related
4615 Sarah burn after reading Brad Pitt person Excellent
9853 Jennifer fight club movie Brad Pitt person Perfect
• Join between the editorial data and the feature file
• Trained a regression model using GBDT
–Gradient Boosted Decision Trees
• 10-fold cross validation optimizing NDCG and tuning
•number of trees
•number of nodes per tree
53. - 53 -
Impact of training data
Number of training instances (judged relations)
54. - 54 -
Performance by query-entity type
•High overall performance but some types are more difficult
•Locations
– Editors downgrade popular entities such as businesses
NDCG by type of the query entity
55. - 55 -
Model Learning challenges
• Editorial preferences not necessarily coincide with usage
– Users click a lot more on people than expected
– Image bias?
• Alternative: optimize for usage data
– Clicks turned into labels or preferences
– Size of the data is not a concern
– Gains are computed from normalized CTR/COEC
– See van Zwol et al. Ranking Entity Facets Based on User Click
Feedback. ICSC 2010: 192-199.
couple of hundred entities and their facets we find that
linear combination of the conditional probabilities gives
t performance on the collected judgements using wqt = 2,
= 0.5, and wf t = 1. However, the editorial data was not
stantial enough to learn a ranking with GBDT.
Click-through Rate versus Click over Expected Click
From the image search query logs, we collect the user click
a that is related to the facets. This allows us to compute the
ck-through rate (CTR) on a facet for a given entity that is
ected in a user query and for which the facets were shown
he user. Let clickse,f be the number of clicks on a facet
ty f show in relation to entity e, and viewse,f the number
times the facet f is shown to a user for a related entity e,
n the probability of a click on a facet entity f for a given
ty e can be modelled as ctre,f :
ctre,f =
clickse,f
viewse,f
(2)
n Figure 3 the conditional click-through rate is shown for
first ten positions. It shows the CTR per position for every
ge view where one of the facets is clicked, aggregated over
coece,f =
cl
PP
p=1 vi
Zhang and Jones [3] refer to
expected clicks, based on the de
expected clicks given the positio
C. Gradient Boosted Decision Tr
Stochastic gradient boosted dec
the most widely used learning alg
today. Gradient tree boosting co
sion model, utilizing decision tr
One advantage over other learn
trees in general is that the feat
are highly interpretable. GBDT
different loss functions can be u
research presented here we used
our loss function. In related work,
pairwise and ranking specific lo
well at improving search relevanc
shallow decision trees, trees in s
on a randomly selected subset of
prone to over-fitting [14]. For the
shown in the search engine
he ground truth for creating
set used by the gradient
onal Probabilities
of the facets search expe-
unction rank(e, f) that is
onal probabilities extracted
⇥Pqs(f|e)+wf t ⇥Pf t (f, e)
(1)
e) are the conditional prob-
he weights for the different
qt), query session (qs) and
l judgements collected for
their facets we find that
ditional probabilities gives
udgements using wqt = 2,
the editorial data was not
ng with GBDT.
k over Expected Click
gs, we collect the user click
all entities. Observe that the CTR declines when the position
at which a facet is shown increases.
We introduce a second click model, based on the notion
of clicks over expected clicks (COEC). To allows us to deal
with the so called position bias – where facets appearing in
lower positions are less likely to be clicked even if they are
relevant [2]. This phenomenon isoften observed in Web search
and we adopt the COEC model proposed by Chapelle and
Zhang [11]. In that model, we estimate ctrp as the aggregated
ctr – over all queries and sessions – in position p for all
positions P. Let then clickse,f be the number of clicks on
a facet entity f show in relation to entity e, and viewse,f p
the
number of times the facet f is shown to a user for a related
entity e at position p. The probability of a click over expected
click on a facet entity f for a given entity e can then be
modelled as coece,f :
coece,f =
clickse,f
PP
p=1 viewse,f p ⇥ ctrp
(3)
Zhang and Jones [3] refer to this method as clicks over
expected clicks, based on the denominator that includes the
expected clicks given the positions that the url appeared in.
C. Gradient Boosted Decision Trees
Stochastic gradient boosted decision trees (GBDT) is one of
57. - 57 -
Ranking and Disambiguation
• We apply the ranking function offline to the data
• Disambiguation
– How many times a given wiki id was retrieved for queries containing the entity
name?
Brad Pitt Brad_Pitt 21158
Brad Pitt Brad_Pitt_(boxer) 247
XXX XXX_(movie) 1775
XXX XXX_(Asia_album) 89
XXX XXX_(ZZ_Top_album) 87
XXX XXX_(Danny_Brown_album) 67
– PageRank for disambiguating locations (wiki ids are not available)
• Expansion to query patterns
– Entity name + context, e.g. brad pitt actor
58. - 58 -
Ranking and Disambiguation challenges
• Disambiguation cases that are too close to call
– Fargo Fargo_(film) 3969
– Fargo Fargo,_North_Dakota 4578
• Disambiguation across Wikipedia and other sources
59. - 59 -
Evaluation #2: Side-by-side testing
• Comparing two systems
– A/B comparison, e.g. current system under development and production
system
– Scale: A is better, B is better
• Separate tests for relevance and image quality
– Image quality can significantly influence user perceptions
– Images can violate safe search rules
• Classification of errors
– Results: missing important results/contains irrelevant results, too few results,
entities are not fresh, more/less diverse, should not have triggered
– Images: bad photo choice, blurry, group shots, nude/racy etc.
• Notes
– Borderline, set one entities relate to the movie Psy but the query is most likely
about Gangnam style
– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of
60’s musicians.
– There is absolutely no relation between Finland and California.
60. - 60 -
Evaluation #3: Bucket testing
• Also called online evaluation
– Comparing against baseline version of the system
– Baseline does not change during the test
• Small % of search traffic redirected to test system, another
small % to the baseline system
• Data collection over at least a week, looking for stat.
significant differences that are also stable over time
• Metrics in web search
– Coverage and Click-through Rate (CTR)
– Searches per browser-cookie (SPBC)
– Other key metrics should not impacted negatively, e.g.
Abandonment and retry rate, Daily Active Users (DAU),
Revenue Per Search (RPS), etc.
61. - 61 -
Coverage before and after the new system
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Days
Coverage
Coverage before Spark
Trend before Spark
Coverage after Spark
Trend after Spark
Spark is deployed
in production
Before release:
Flat, lower
After release:
Flat, higher
62. - 62 -
Click-through rate (CTR) before and after the new system
Before release:
Gradually
degrading performance
due to lack of fresh data
After release:
Learning effect:
users are starting to
use the tool again
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Days
CTR
CTR before Spark
Trend before Spark
CTR after Spark
Trend after Spark
Spark is deployed
in production
63. - 63 -
Summary
• Spark
– System for related entity recommendations
• Knowledge base
• Extraction of signals from query logs and other user-generated
content
• Machine learned ranking
• Evaluation
• Other applications
– Recommendations on topic-entity pages
64. - 64 -
Future work
• New query types
– Queries with multiple entities
• adele skyfall
– Question-answering on keyword queries
• brad pitt movies
• brad pitt movies 2010
• Extending coverage
– Spark now live in CA, UK, AU, NZ, TW, HK, ES
• Even fresher data
– Stream processing of query log data
• Data quality improvements
• Online ranking with post-retrieval features
65. - 65 -
The End
• Many thanks to
– Barla Cambazoglu and Roi Blanco (Barcelona)
– Nicolas Torzec (US)
– Libby Lin (Product Manager, US)
– Search engineering (Taiwan)
• Contact
– pmika@yahoo-inc.com
– @pmika
• Internships available for PhD students
Notas do Editor
In fact, some of these searches are so hard that the users don’t even try them anymore
In fact, some of these searches are so hard that the users don’t even try them anymore
With ads, the situation is even worse due to the sparsity problem. Note how poor the ads are…
Challlenges “ so someone could search for ‘volcanic eruptions in the 18th century’ or ‘Lady Gaga concerts in a warm outdoor location.’” For now, there’s no way for Google or Microsoft to reliably parse those sorts of queries against entities’ properties. And Microsoft’s Weitz says that “people have been beaten down in search for so long, they’d never ask” those sorts of questions in the first place. Complex relational queries are the sorts of things that Bing and Google tend to push off to their more vertical search areas, like Bing’s shopping and travel search and Google’s Shopping and Flights.
This is how a human sees the world.
This is how a machine sees the world… Machines are not ‘intelligent’ and can not ‘read’… they just see a string of symbols and try to match the users input to that stream.
Semantic search can be seen as a retrieval paradigm Centered on the use of semantics Incorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
Conceptual models, e.g. ER models, capture relations between data elements , which essentially capture relationships between syntactic elements and their interpretations
- In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval). The technical differences between these two paradigms can be broken down into three main dimensions, i.e. the representation of the user need ( query model ), the underlying resources ( data model ) and the matching technique . The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, they usually fail to address more complex information needs. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.