SlideShare uma empresa Scribd logo
1 de 65
Related Entity Finding
on the Web
Peter Mika
Senior Research Scientist
Yahoo! Research
Joint work with B. Barla Cambazoglu and Roi Blanco
- 2 -
Yahoo! serves over 700 million users in 25 countries
- 3 -
Yahoo! Research: visit us at research.yahoo.com
- 4 -
Yahoo! Research Barcelona
• Established January, 2006
• Led by Ricardo Baeza-Yates
• Research areas
– Web Mining
• content, structure, usage
– Social Media
– Distributed Systems
– Semantic Search
- 5 -
Search is really fast, without necessarily being intelligent
- 6 -
Why Semantic Search? Part I
• Improvements in IR are harder and harder to come by
– Machine learning using hundreds of features
• Text-based features for matching
• Graph-based features provide authority
– Heavy investment in computational power, e.g. real-time
indexing and instant search
• Remaining challenges are not computational, but in
modeling user cognition
– Need a deeper understanding of the query, the content and/or
the world at large
– Could Watson explain why the answer is Toronto?
- 7 -
Poorly solved information needs
• Ambiguous searches
– Multiple interpretations
• jaguar
• paris hilton
– Secondary meaning
• george bush (and I mean the beer brewer in Arizona)
– Subjectivity
• reliable digital camera
• paris hilton sexy
– Imprecise or overly precise searches
• jim hendler
- 8 -
Poorly solved information needs
• Complex queries
– Missing information
• brad pitt zombie
• florida man with 115 guns
• peter mika barcelona
• 34 year old computer scientist living in barcelona
– Category queries
• countries in africa
• barcelona nightlife
– Complex needs
• 120 dollars in euros
• digital camera under 300 dollars
• world temperature in 2020
Many of these queries
would not be asked by
users, who learned over
time what search
technology can and can
not do.
Many of these queries
would not be asked by
users, who learned over
time what search
technology can and can
not do.
- 9 -
Ambiguity
- 10 -
Why Semantic Search? II.
• The Semantic Web is now a reality
– Standardization and development
• Standards from the W3C
• Tools and services from a variety of vendors
– Data
• Large amounts of public data published in RDF
• Investment in private graphs
• Commercial interest
– Bing, Google, Yahoo, Yandex, Facebook
– Need to satisfy end users
• Not trained in logic
• Can’t learn complex query languages such as SPARQL
• May not know the data
• May not be experts in the domain
- 11 -
This is not a business model
http://en.wikipedia.org/wiki/Underpants_Gnomes
- 12 -
Entity-based Disambiguation
- 13 -
Entity Search
- 14 -
Entity-based Navigation / Exploration
- 15 -
Factual Search
- 16 -
Facebook Graph Search
- 17 -
Facebook Graph Search
- 18 -
Aggregation and computation
- 19 -
Contextual (pervasive, ambient) search
Yahoo! Connected
TV:
Widget engine
embedded into the
TV
Yahoo! Connected
TV:
Widget engine
embedded into the
TV
Yahoo! IntoNow: recognize
audio and show related content
Yahoo! IntoNow: recognize
audio and show related content
- 20 -
Interactive Voice Search
• Apple’s Siri
– Question-Answering
• Variety of backend sources
including Wolfram Alpha and
various Yahoo! services
– Task completion
• E.g. schedule an event
• Google Now
- 21 -
Conversational Search
• Google’s Interactive Voice Search
- 22 -
Parlance project
• Interactive, conversational voice search
– Complex dialogs around a set of objects
• Restaurant
– Area
– Price range
– Type of cuisine
– Complete system (mixed licence)
• Automated Speech Recognition (ASR)
• Spoken Language Understanding (SLU)
• Interaction Management
• Knowledge Base
• Natural Language Generation (NLG)
• Text-to-Speech (TTS)
– Video
• Commercial alternatives from Nuance
What is Semantic Search?
- 24 -
What it’s like to be a machine?
Roi Blanco
- 25 -
What it’s like to be a machine?
✜Θ♬♬ţğ
✜Θ♬♬ţğ√∞®ÇĤĪ✜★♬☐✓✓
ţğ★✜
✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ
≠=⅚©§★✓♪ΒΓΕ℠
✖Γ ±♫⅜ ⏎↵⏏☐ģğğğμλκσςτ
⏎⌥°¶§ΥΦΦΦ✗✕☐
- 26 -
Semantic Search: a definition
• Semantic search is a retrieval paradigm where
– User intent and resources are represented using semantic models
• Not just symbolic representations
– Semantic models are exploited in the matching and ranking of resources
• Often a hybrid of document and data retrieval
– Documents with metadata
• Metadata may be embedded inside the document
• I’m looking for documents that mention countries in Africa.
– Data retrieval
• Structured data, but searchable text fields
• I’m looking for directors, who have directed movies where the synopsis
mentions dinosaurs.
• Wide range of semantic search systems
– Employ different semantic models, possibly at different steps of the search
process and in order to support different tasks
- 27 -
Note about explicit vs. implicit semantics
• Semantic Search methods focus on explicit semantics
– Elements of text or data are mapped to a substantially different
representation according to some theory of language or some
theory of the world
• As opposed to “bottom-up” or implicit semantics
– e.g. Latent Semantic Analysis (LSA)
- 28 -
Semantic models
• Semantics is concerned with the meaning of the resources made
available for search
– Various representations of meaning
• Linguistic models: models of relationships among words
– Taxonomies, thesauri, dictionaries of entity names
– Natural language structures extracted from text, e.g. using dependency
parsing
– Inference along linguistic relations, e.g. broader/narrower terms, textual
entailment
• Conceptual models: models of relationships among objects
– Ontologies capture entities in the world and their relationships
– Words and phrases in text or records in a database are identified as
representations of ontological elements
– Inference along ontological relations, e.g. logical entailment
- 29 -
Semantic Search – a process view
Document Representation
Knowledge Representation
Semantic Models
Resources
Documents
- 30 -
Semantic Search as a research field
• Intersection of Semantic Web, Information Retrieval, Natural Language
Processing and Databases
• Emerging field of research
– Workshops
• Exploiting Semantic Annotations in Information Retrieval (2008-2012) at CIKM
• Semantic Search (SemSearch) workshop series (2008-2011) at ESWC and WWW
• Entity-oriented search workshop (2010-2011) at SIGIR
• Joint Intl. Workshop on Semantic and Entity-oriented Search (2012) at SIGIR
• Semantic Search Workshop (2011-2013) at VLDB
– Special Issues of journals
– Surveys
• Related fields:
– XML retrieval
– Keyword search in databases
– Natural Language Retrieval
- 31 -
Comparison to IR and DB
• Information Retrieval (IR) support the retrieval of documents
(document retrieval)
– Representation based on lightweight syntax-centric models
– Work well for topical search
– Not so well for more complex information needs
– Web scale
• Database (DB) and Knowledge-based Systems (KB) deliver more
precise answers (data retrieval)
– More expressive models
– Allow for complex queries
– Retrieve concrete answers that precisely match queries
– Not just matching and filtering, but also joins and more complex
inference
– Limitations in scalability
- 32 -
Search is required in the presence of ambiguity
Query
Data
KeywordsKeywords
NL
Questions
NL
Questions
Form- / facet-
based Inputs
Form- / facet-
based Inputs
Structured Queries
(SPARQL)
Structured Queries
(SPARQL)
OWL ontologies with
rich, formal
semantics
OWL ontologies with
rich, formal
semantics
Structured
RDF data
Structured
RDF data
Semi-
Structured
RDF data
Semi-
Structured
RDF data
RDF data
embedded in
text (RDFa)
RDF data
embedded in
text (RDFa)
Ambiguities: interpretation
Ambiguities: interpretation, extraction errors, data quality, confidence/trust
- 33 -
list search
related entity finding
entity search
SemSearch 2010/11
list completion
SemSearch 2011
TREC ELC taskTREC REF-LOD task
semantic search
Common tasks in Semantic Search
Related entity ranking
in web search
- 35 -
Motivation
• Some users are short on time
– Need for direct answers
– Query expansion, question-answering, information boxes, rich
results…
• Other users have time at their hand
– Long term interests such as sports, celebrities, movies and
music
– Long running tasks such as travel planning
- 36 -
Example user sessions
- 37 -
Spark: related entity recommendations in web search
• A search assistance tool for exploration
• Recommend related entities given the user’s current query
– Cf. Entity Search at SemSearch, TREC Entity Track
• Ranking explicit relations in a Knowledge Base
– Cf. TREC Related Entity Finding in LOD (REF-LOD) task
• A previous version of the system live since 2010
• van Zwol et al.: Faceted exploration of image search results. WWW
2010: 961-970
- 38 -
Spark example I.
- 39 -
Spark example II.
- 40 -
How does it work?
/user/torzecn/
Shared/Entity-
Relationship_Graphs
Yalinda feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/{1,2,3}/
yalinda/phyfacet/data
/projects/gridfaces/
feed/yahooomg/
Y! OMG feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/6/yahooomg/
phyfacet/data
/projects/gridfaces/
feed/geo
Y! Geo feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/10/geo/
phyfacet/data
/projects/gridfaces/
feed/yahootv
Y! TV feed parser
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfstorageall/5
/projects/gridfaces/
feed/editorialdata/
data/objects
Editorial feed parser
(GFFeedMapper,
GFEditorialMergeReducer)
$VIS_GRID_HOME/
gfstorageall/logicalobj/
dataArchive/
$VIS_GRID_HOME/
gfstorageall/
logicalfacet/data
$VIS_GRID_HOME/
gfstorageall/logicalobj/
data
/projects/gridfaces/
feed/editorialdata/
data/facets
GFDumpMain-First
(DumpFirstMapper,
DumpFirstReducer)
Empty/missing
Activated
Deactivated
$VIS_GRID_HOME/
rankinput
STEP 1: FEED PARSING PIPELINE
GFDumpMain-Second
(DumpSecondMapper,
DumpSecondReducer)
$VIS_GRID_HOME/
rankinputTmp
CreateDictionary
(DictionaryMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/
dictionary
$VIS_GRID_HOME/
rankprepout/spark/
logs/week{WNo}
/data/SDS/data/
search_US
CreateIntermediate
LogFormat
(NullValueSillyMapp
er, DistinctReducer)
CreateQuerySessions
CommonModelFilter
(FiltererMapper,
FiltererReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qsessions lfi terer
CreateQuerySessions
CommonModel
(SillyMapper,
QuerySessionsCreate
SessionsReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qsessions_cm
CreateQueryTermsJoin
DictionaryAndLogs
(LeftJoinMapper,
LeftJoinReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qtermsjoindictionarylog
CreateQueryTermsJoin
QTermsAndLogs
(SillyMapper,
ImplodeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qterms_cm
CreateFlickrIntermediate
LogFormat
(FlickrReformatMapper)
/projects/gridfaces/
ifl ckr/feed
$VIS_GRID_HOME/
rankprepout/tmp/
ifl ckrtagsintermediate
CreateFlickrTagsCommon
ModelFilter
(FiltererMapper,
FiltererReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
ifl ckrtags lfi terer
CreateFlickrTagsCommon
ModelFilter
(SillyMapper,
ImplodeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
ifl ckr_cm
General
Query logs
Flickr
Twitter
/projects/rtds/twitter/
refi hose
CreateTwitterIntermediate
LogFormat
(NullValueSillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/spark/
twitter_logs/week{WNo}
CreateTweetsCommon
ModelFilter
(FiltererMapper,
FiltererReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
tweets lfi terer
CreateTweetsCommon
Model
(SillyMapper,
ImplodeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
tweets_cm
DistinctUsersTwitter
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/tweets/
distinctusers
DistinctUsers ifl ckr
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/ ifl ckr/
distinctusers
DistinctUsersQSessions
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qsessions/distinctusers
DistinctUsersQTerms
(SillyMapper,
DistinctReducer)
$VIS_GRID_HOME/
rankprepout/tmp/qterms/
distinctusers
CountUsers
(CounterMapper,
CounterReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
distinctusers
CountEvents
(CounterMapper,
CounterReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
STEP 2: PREPROCESSING PIPELINE (before feature extraction and ranking)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
probability
EventProbabilityQTerms
(ProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
qterms_cm
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
EventConditionalProbabilityQterms
(ConditionalProbabilityMapper,
ConditionalProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
conditionalprobability
EventJointProbabilityQterms
(JointProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
jointprobability
EventJointUserProbabilityQTerms
(JointUserProbabilityMapper,
JointUserProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countusers
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
jointuserprobability
EventConditionalUserProb
abilityQterms
(ConditionalUserProbability
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserproba
bility1/qterms
ConditionalUserProbability
2_3-qt
(ConditionalUserProbability
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserprobability2/
qterms/query
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserprobability2/
qterms/queryfacet
EventConditionalUser
Probability3_qterms
(SillyMapper,
ConditionalUserProba
bilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/qterms/
week{WNo}/
conditionaluserprobability
EventEntropyQTerms
(EntropyMapper,
EntropyReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
entropy
EventPMI1QTerms
(SillyMapper,
JoinUnaryMetricReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
pmiqterms
EventPMIQTerms
(PMIMapper,
PMIReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
pmi
EventKLDivergenceUnary1QTerms
(KLDivergenceUnaryJoinerMapper,
JoinUnaryMetricReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
klunaryqterms
EventKLDivergenceUnaryQTerms
(SillyMapper,
KLDivergenceUnaryReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
kldivergenceunary
EventCosineSimilarityQTerms
(PMIMapper,
CosineSimilarityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
cosinesimilarity
Extracted features
Others
STEP 3a: FEATURE EXTRACTION (QUERY TERMS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
probability
EventProbabilityFlickr
(ProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
ifl ckr_cm
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
EventConditionalProbabilityFlickr
(ConditionalProbabilityMapper,
ConditionalProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
conditionalprobability
EventJointProbabilityFlickr
(JointProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
jointprobability
EventJointUserProbabilityFlickr
(JointUserProbabilityMapper,
JointUserProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countusers
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
jointuserprobability
EventConditionalUserProb
abilityFlickr
(ConditionalUserProbability
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserproba
bility1/ ifl ckr
ConditionalUserProbability
2_3-fl
(ConditionalUserProbability
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserprobability2/
ifl ckr/query
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserprobability2/
ifl ckr/queryfacet
EventConditionalUser
Probability3_ ifl ckr
(SillyMapper,
ConditionalUserProba
bilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/ ifl ckr/
week{WNo}/
conditionaluserprobability
EventEntropyFlickr
(EntropyMapper,
EntropyReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
entropy
EventPMI1Flickr
(SillyMapper,
JoinUnaryMetricReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
pmi ifl ckr
EventPMIFlickr
(PMIMapper,
PMIReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
pmi
EventKLDivergenceUnary1Flickr
(KLDivergenceUnaryJoinerMapper,
JoinUnaryMetricReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
klunary ifl ckr
EventKLDivergenceUnaryFlickr
(SillyMapper,
KLDivergenceUnaryReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
kldivergenceunary
EventCosineSimilarityFlickr
(PMIMapper,
CosineSimilarityReducer)
$VIS_GRID_HOME/
rankprepout/spark/ ifl ckr/
week{WNo}/
cosinesimilarity
Extracted features
Others
STEP 3c: FEATURE EXTRACTION (FLICKR TAGS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
probability
EventProbabilityTweets
(ProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
tweets_cm
$VIS_GRID_HOME/
rankprepout/tmp/
countevents
EventConditionalProbabilityTweets
(ConditionalProbabilityMapper,
ConditionalProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
conditionalprobability
EventJointProbabilityTweets
(JointProbabilityMapper,
ProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
jointprobability
EventJointUserProbabilityTweets
(JointUserProbabilityMapper,
JointUserProbabilityReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
countusers
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
jointuserprobability
EventConditionalUserProb
abilityTweets
(ConditionalUserProbability
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserproba
bility1/tweets
ConditionalUserProbability
2_3-tw
(ConditionalUserProbability
PrepareMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserprobability2/
tweets/query
$VIS_GRID_HOME/
rankprepout/tmp/
conditionaluserprobability2/
tweets/queryfacet
EventConditionalUser
Probability3_tweets
(SillyMapper,
ConditionalUserProba
bilityReducer)
$VIS_GRID_HOME/
rankprepout/spark/tweets/
week{WNo}/
conditionaluserprobability
EventEntropyTweets
(EntropyMapper,
EntropyReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
entropy
EventPMI1Tweets
(SillyMapper,
JoinUnaryMetricReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
pmitweets
EventPMITweets
(PMIMapper,
PMIReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
pmi
EventKLDivergenceUnary1Tweets
(KLDivergenceUnaryJoinerMapper,
JoinUnaryMetricReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
klunarytweets
EventKLDivergenceUnary1Tweets
(SillyMapper,
KLDivergenceUnaryReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
kldivergenceunary
EventCosineSimilarityTweets
(PMIMapper,
CosineSimilarityReducer)
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
cosinesimilarity
Extracted features
Others
STEP 3d: FEATURE EXTRACTION (TWEETS) PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
probability
unaryfeaturemerger1_qterms
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_qterms
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
conditionalprobability
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
jointuserprobability
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
conditionaluserprobability
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
entropy
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
pmi
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
rankprepout/spark/
qterms/week{WNo}/
cosinesimilarity
STEP 4a: FEATURE MERGING (QUERY TERMS) PIPELINE
unaryfeaturemerger2_qterms
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_qterms
symmetricfeaturemerger_qterms
(MergeFeaturesMapper,
SymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_qterms
asymmetricfeaturemerger_qterms
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_qterms
reverseasymmetricfeaturemerger_qterms
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_
qterms
Features
Others
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
probability
unaryfeaturemerger1_qsessions
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_qsessions
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
conditionalprobability
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
jointuserprobability
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/week{WNo}/
conditionaluserprobability
$VIS_GRID_HOME
/rankprepout/spark/
qsessions/
week{WNo}/entropy
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/pmi
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/
week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
rankprepout/spark/
qsessions/week{WNo}/
cosinesimilarity
STEP 4b: FEATURE MERGING (QUERY SESSIONS) PIPELINE
unaryfeaturemerger2_qsessions
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_qsessions
symmetricfeaturemerger_qsessions
(MergeFeaturesMapper,
SymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_qsessions
asymmetricfeaturemerger_qsessions
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_qsessions
reverseasymmetricfeaturemerger_qsessions
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_
qsessions
Features
Others
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
probability
unaryfeaturemerger1_ ifl ckr
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_ ifl ckr
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
conditionalprobability
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
jointuserprobability
$VIS_GRID_HOME/
rankprepout/spark/ ifl ckr/
week{WNo}/
conditionaluserprobability
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
entropy
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/pmi
$VIS_GRID_HOME/
rankprepout/spark/
ifl ckr/week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
rankprepout/spark/ ifl ckr/
week{WNo}/
cosinesimilarity
STEP 4c: FEATURE MERGING (FLICKR TAGS) PIPELINE
unaryfeaturemerger2_ ifl ckr
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_ ifl ckr
symmetricfeaturemerger_ ifl ckr
(MergeFeaturesMapper,
SymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_ ifl ckr
asymmetricfeaturemerger_ ifl ckr
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_ ifl ckr
reverseasymmetricfeaturemerger_ ifl ckr
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_f
lickr
Features
Others
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
probability
unaryfeaturemerger1_tweets
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary1_tweets
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
conditionalprobability
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
jointprobability
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
jointuserprobability
$VIS_GRID_HOME/
rankprepout/spark/tweets/
week{WNo}/
conditionaluserprobability
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
entropy
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
pmi
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
kldivergenceunary
$VIS_GRID_HOME/
rankprepout/spark/
tweets/week{WNo}/
cosinesimilarity
STEP 4d: FEATURE MERGING (TWEETS) PIPELINE
unaryfeaturemerger2_tweets
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
unary2_tweets
symmetricfeaturemerger_tweets
(MergeFeaturesMapper,
SymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
symmetric_tweets
asymmetricfeaturemerger_tweets
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_tweets
reverseasymmetricfeaturemerger_tweets
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOM
E/rankprepout/tmp/
statsmerge/
reverseasymetric_t
weets
Features
Others
$VIS_GRID_HOME/
rankinput
STEP 5a: FEATURE EXTRACTION AND MERGING (COMBINED FEATURES) PIPELINE
combinedfeaturem
erger_qsessions
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_qsessions
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_qsessions
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_qterms
combinedfeaturem
erger_qterms
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_qterms
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_ ifl ckr
combinedfeaturem
erger_ ifl ckr
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_ ifl ckr
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
reverseasymetric_tweets
combinedfeaturem
erger_tweets
(CombinedFeature
MergerMapper)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
combined_tweets
joinfeatures
(JoinerMapper,
JoinerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeatures3
STEP 5b: FEATURE EXTRACTION AND MERGING (GRAPH) PIPELINE
Features
Others
$VIS_GRID_HOME
/rankprepout/tmp/
statsmerge/
joinfeatures3
$VIS_GRID_HOME/
rankinput
graph_sharedconnect_1
(GRSharedConnectMapper,
GRSharedConnectReducer)
$VIS_GRID_HOME/
rankprepout/graph/
sharedconnect
graph_sharedconnect_2
(NullValueSillyMapper,
CounterReducer)
$VIS_GRID_HOME/
rankprepout/graph/
sharedconnect_2
join_graph
(MergeFeaturesMapper,
SymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_g_1
graph_popularity_rank_all
(GRPopularityRankMapper,
GRPopularityRankReducer)
$VIS_GRID_HOME/
rankprepout/graph/
popularity_rank_all
graph_popularity_rank_directed
(GRPopularityRankMapper,
GRPopularityRankReducer)
$VIS_GRID_HOME/
rankprepout/graph/
popularity_rank_directed
unaryfeaturemerger_entpopmov
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop1
unaryfeaturemerger_entpopmov2
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME
/rankprepout/tmp/
statsmerge/
joinfeature_pop2
unaryfeaturemerger_entpopmov3
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop3
unaryfeaturemerger_entpopmov4
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop4
STEP 5c: FEATURE MERGING (POPULARITY) PIPELINE
Features
Others
$VIS_GRID_HOME/
rankinput
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeature_pop4
WebCitationTotalHits
Normalization
(SillyMapper)
$VIS_GRID_HOME/
web_citation/
total_hits
$VIS_GRID_HOME/
rankprepout/tmp/
webcitation_totalhits
WebCitationDeepHits
Normalization
(SillyMapper)
$VIS_GRID_HOME/
web_citation/
deep_hits
$VIS_GRID_HOME/
rankprepout/tmp/
webcitation_deephits
asymmetricfeaturemerger_webcitation
(MergeFeaturesMapper,
AsymmetricFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
asymmetric_webcitation
coverage
(QueryCountMapper,
QueryCountReducer)
$VIS_GRID_HOME/
rankprepout/spark/
coverage/week{WNo}
$VIS_GRID_HOME/
rankprepout/spark/
logs/week{WNo}
joincov1
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeaturescov1
joincov2
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeaturescov2
joinwikipop1
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
/user/barla/Spark/
wikiResultCounts
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeatureswikipop1
joinwikipop2
(MergeFeaturesMapper,
UnaryFeatureMergerReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/
joinfeatureswikipop2
jointypes
(EntityRelationTypeMapper,
EntityRelationTypeReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/jointypes
STEP 6: RANKING PIPELINE
MLR scoring
(/homes/barla/gf_mlr/
mlr_rank)
$VIS_GRID_HOME/
rankprepout/tmp/
statsmerge/jointypes
/homes/barla/gf_mlr/
gbrank.xml
/homes/barla/gf_mlr/
header.tsv
$VIS_GRID_HOME/
rankprepout/spark/
ranking
MLR scoring
(SillyMapper,
DisambiguationReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ranking-disambiguated
groupmax
(SillyMapper,
OutputMaxReducer)
$VIS_GRID_HOME/
rankprepout/tmp/
ranking
formatranking
(RankingFormatterMapper,
RankingFormatterReducer)
$VIS_GRID_HOME/
rankprepout/spark/
ranking nfi al
STEP 7: DATAPACK GENERATION PIPELINE
$VIS_GRID_HOME/
rankprepout/spark/
ranking nfi al
mergerank
(GFFeedMapper,
GFFeedReducer)
$VIS_GRID_HOME/
gfrankout
$VIS_GRID_HOME/
gfrankout/1/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfrankout/2/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfrankout/3/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfrankout/4/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfrankout/5/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfrankout10/geo/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfstorageall/1/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfstorageall/2/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfstorageall/3/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfstorageall/4/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfstorageall/5/yalinda/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfstorageall/10/geo/
phyfacet/ranking/spark
$VIS_GRID_HOME/
gfstorageall/1/yalinda/
phyfacet/data
$VIS_GRID_HOME/
gfstorageall/2/yalinda/
phyfacet/data
$VIS_GRID_HOME/
gfstorageall/3/yalinda/
phyfacet/data
$VIS_GRID_HOME/
gfstorageall/4/yalinda/
phyfacet/data
$VIS_GRID_HOME/
gfstorageall/5/yalinda/
phyfacet/data
$VIS_GRID_HOME/
gfstorageall/10/geo/
phyfacet/data
$VIS_GRID_HOME/
gfstorageall/
logicalobj/data
dumper1
(DumpFirstMapper,
DumpFirstReducer)
$VIS_GRID_HOME/
platdumpTmp
dumper2
(DumpSecondMapper,
DumpSecondReducer)
$VIS_GRID_HOME/
platdump
datapack
(Mapper,
YISFacetMergeReducer)
$VIS_GRID_HOME/
datapack
- 41 -
High-Level Architecture View
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
- 42 -
Spark Architecture
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
- 43 -
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
Data Preprocessing
- 44 -
Entity graph
• 3.4 million entities, 160 million relations
• Locations: Internet Locality, Wikipedia, Yahoo! Travel
• Athletes, teams: Yahoo! Sports
• People, characters, movies, TV shows, albums: Dbpedia
• Example entities
• Dbpedia Brad_Pitt Brad Pitt Movie_Actor
• Dbpedia Brad_Pitt Brad Pitt Movie_Producer
• Dbpedia Brad_Pitt Brad Pitt Person
• Dbpedia Brad_Pitt Brad Pitt TV_Actor
• Dbpedia Brad_Pitt_(boxer) Brad Pitt Person
• Example relations
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor
• Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor
- 45 -
Entity graph challenges
• Coverage of the query volume
– New entities and entity types
– Additional inference
– International data
– Aliases, e.g. jlo, big apple, thomas cruise mapother iv
• Freshness
– People query for a movie long before it’s released
• Irrelevant entity and relation types
– E.g. voice actors who co-acted in a movie, cities in a continent
• Data quality
– United States Senate career of Barack Obama is not a person
– Andy Lau has never acted in Iron Man 3
- 46 -
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
Feature extraction
- 47 -
Feature extraction from text
• Text sources
– Query terms
– Query sessions
– Flickr tags
– Tweets
• Common representation
Input tweet:
Brad Pitt married to Angelina Jolie in Las Vegas
Output event:
Brad Pitt + Angelina Jolie
Brad Pitt + Las Vegas
Angelina Jolie + Las Vegas
- 48 -
Features
• Unary
– Popularity features from text: probability, entropy, wiki id
popularity …
– Graph features: PageRank on the entity graph, wikipedia, web
graph
– Type features: entity type
• Binary
– Co-occurrence features from text: conditional probability, joint
probability …
– Graph features: common neighbors …
– Type features: relation type
- 49 -
Feature extraction challenges
• Efficiency of text tagging
– Hadoop Map/Reduce
• More features are not always better
– Can lead to over-fitting without sufficient training data
- 50 -
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
Model Learning
- 51 -
Model Learning
• Training data created by editors (five grades)
400 Brandi adriana lima Brad Pitt person Embarassing
1397 David H. andy garcia Brad Pitt person Mostly Related
3037 Jennifer benicio del toro Brad Pitt person Somewhat Related
4615 Sarah burn after reading Brad Pitt person Excellent
9853 Jennifer fight club movie Brad Pitt person Perfect
• Join between the editorial data and the feature file
• Trained a regression model using GBDT
–Gradient Boosted Decision Trees
• 10-fold cross validation optimizing NDCG and tuning
•number of trees
•number of nodes per tree
- 52 -
Feature importance
RANK FEATURE IMPORTANCE
1 Relation type 100
2 PageRank (Related entity) 99.6075
3 Entropy – Flickr 94.7832
4 Probability – Flickr 82.6172
5 Probability – Query terms 78.9377
6 Shared connections 68.296
7 Cond. Probability – Flickr 68.0496
8 PageRank (Entity) 57.6078
9 KL divergence – Flickr 55.4604
10 KL divergence – Query terms 55.0662
- 53 -
Impact of training data
Number of training instances (judged relations)
- 54 -
Performance by query-entity type
•High overall performance but some types are more difficult
•Locations
– Editors downgrade popular entities such as businesses
NDCG by type of the query entity
- 55 -
Model Learning challenges
• Editorial preferences not necessarily coincide with usage
– Users click a lot more on people than expected
– Image bias?
• Alternative: optimize for usage data
– Clicks turned into labels or preferences
– Size of the data is not a concern
– Gains are computed from normalized CTR/COEC
– See van Zwol et al. Ranking Entity Facets Based on User Click
Feedback. ICSC 2010: 192-199.
couple of hundred entities and their facets we find that
linear combination of the conditional probabilities gives
t performance on the collected judgements using wqt = 2,
= 0.5, and wf t = 1. However, the editorial data was not
stantial enough to learn a ranking with GBDT.
Click-through Rate versus Click over Expected Click
From the image search query logs, we collect the user click
a that is related to the facets. This allows us to compute the
ck-through rate (CTR) on a facet for a given entity that is
ected in a user query and for which the facets were shown
he user. Let clickse,f be the number of clicks on a facet
ty f show in relation to entity e, and viewse,f the number
times the facet f is shown to a user for a related entity e,
n the probability of a click on a facet entity f for a given
ty e can be modelled as ctre,f :
ctre,f =
clickse,f
viewse,f
(2)
n Figure 3 the conditional click-through rate is shown for
first ten positions. It shows the CTR per position for every
ge view where one of the facets is clicked, aggregated over
coece,f =
cl
PP
p=1 vi
Zhang and Jones [3] refer to
expected clicks, based on the de
expected clicks given the positio
C. Gradient Boosted Decision Tr
Stochastic gradient boosted dec
the most widely used learning alg
today. Gradient tree boosting co
sion model, utilizing decision tr
One advantage over other learn
trees in general is that the feat
are highly interpretable. GBDT
different loss functions can be u
research presented here we used
our loss function. In related work,
pairwise and ranking specific lo
well at improving search relevanc
shallow decision trees, trees in s
on a randomly selected subset of
prone to over-fitting [14]. For the
shown in the search engine
he ground truth for creating
set used by the gradient
onal Probabilities
of the facets search expe-
unction rank(e, f) that is
onal probabilities extracted
⇥Pqs(f|e)+wf t ⇥Pf t (f, e)
(1)
e) are the conditional prob-
he weights for the different
qt), query session (qs) and
l judgements collected for
their facets we find that
ditional probabilities gives
udgements using wqt = 2,
the editorial data was not
ng with GBDT.
k over Expected Click
gs, we collect the user click
all entities. Observe that the CTR declines when the position
at which a facet is shown increases.
We introduce a second click model, based on the notion
of clicks over expected clicks (COEC). To allows us to deal
with the so called position bias – where facets appearing in
lower positions are less likely to be clicked even if they are
relevant [2]. This phenomenon isoften observed in Web search
and we adopt the COEC model proposed by Chapelle and
Zhang [11]. In that model, we estimate ctrp as the aggregated
ctr – over all queries and sessions – in position p for all
positions P. Let then clickse,f be the number of clicks on
a facet entity f show in relation to entity e, and viewse,f p
the
number of times the facet f is shown to a user for a related
entity e at position p. The probability of a click over expected
click on a facet entity f for a given entity e can then be
modelled as coece,f :
coece,f =
clickse,f
PP
p=1 viewse,f p ⇥ ctrp
(3)
Zhang and Jones [3] refer to this method as clicks over
expected clicks, based on the denominator that includes the
expected clicks given the positions that the url appeared in.
C. Gradient Boosted Decision Trees
Stochastic gradient boosted decision trees (GBDT) is one of
- 56 -
Entity
graph
Data
preprocessing
Feature
extraction
Model
learning
Feature
sources
Editorial
judgements
Datapack
Ranking
model
Ranking and
disambiguation
Entity
data
Features
Ranking and Disambiguation
- 57 -
Ranking and Disambiguation
• We apply the ranking function offline to the data
• Disambiguation
– How many times a given wiki id was retrieved for queries containing the entity
name?
Brad Pitt Brad_Pitt 21158
Brad Pitt Brad_Pitt_(boxer) 247
XXX XXX_(movie) 1775
XXX XXX_(Asia_album) 89
XXX XXX_(ZZ_Top_album) 87
XXX XXX_(Danny_Brown_album) 67
– PageRank for disambiguating locations (wiki ids are not available)
• Expansion to query patterns
– Entity name + context, e.g. brad pitt actor
- 58 -
Ranking and Disambiguation challenges
• Disambiguation cases that are too close to call
– Fargo Fargo_(film) 3969
– Fargo Fargo,_North_Dakota 4578
• Disambiguation across Wikipedia and other sources
- 59 -
Evaluation #2: Side-by-side testing
• Comparing two systems
– A/B comparison, e.g. current system under development and production
system
– Scale: A is better, B is better
• Separate tests for relevance and image quality
– Image quality can significantly influence user perceptions
– Images can violate safe search rules
• Classification of errors
– Results: missing important results/contains irrelevant results, too few results,
entities are not fresh, more/less diverse, should not have triggered
– Images: bad photo choice, blurry, group shots, nude/racy etc.
• Notes
– Borderline, set one entities relate to the movie Psy but the query is most likely
about Gangnam style
– Blondie and Mickey Gilley are 70’s performers and do not belong on a list of
60’s musicians.
– There is absolutely no relation between Finland and California.
- 60 -
Evaluation #3: Bucket testing
• Also called online evaluation
– Comparing against baseline version of the system
– Baseline does not change during the test
• Small % of search traffic redirected to test system, another
small % to the baseline system
• Data collection over at least a week, looking for stat.
significant differences that are also stable over time
• Metrics in web search
– Coverage and Click-through Rate (CTR)
– Searches per browser-cookie (SPBC)
– Other key metrics should not impacted negatively, e.g.
Abandonment and retry rate, Daily Active Users (DAU),
Revenue Per Search (RPS), etc.
- 61 -
Coverage before and after the new system
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Days
Coverage
Coverage before Spark
Trend before Spark
Coverage after Spark
Trend after Spark
Spark is deployed
in production
Before release:
Flat, lower
After release:
Flat, higher
- 62 -
Click-through rate (CTR) before and after the new system
Before release:
Gradually
degrading performance
due to lack of fresh data
After release:
Learning effect:
users are starting to
use the tool again
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80
Days
CTR
CTR before Spark
Trend before Spark
CTR after Spark
Trend after Spark
Spark is deployed
in production
- 63 -
Summary
• Spark
– System for related entity recommendations
• Knowledge base
• Extraction of signals from query logs and other user-generated
content
• Machine learned ranking
• Evaluation
• Other applications
– Recommendations on topic-entity pages
- 64 -
Future work
• New query types
– Queries with multiple entities
• adele skyfall
– Question-answering on keyword queries
• brad pitt movies
• brad pitt movies 2010
• Extending coverage
– Spark now live in CA, UK, AU, NZ, TW, HK, ES
• Even fresher data
– Stream processing of query log data
• Data quality improvements
• Online ranking with post-retrieval features
- 65 -
The End
• Many thanks to
– Barla Cambazoglu and Roi Blanco (Barcelona)
– Nicolas Torzec (US)
– Libby Lin (Product Manager, US)
– Search engineering (Taiwan)
• Contact
– pmika@yahoo-inc.com
– @pmika
• Internships available for PhD students

Mais conteúdo relacionado

Mais procurados

Making things findable
Making things findableMaking things findable
Making things findablePeter Mika
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the RisePeter Mika
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsPeter Mika
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Grace Hui Yang
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications sathish sak
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in PracticePeter Mika
 
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...semanticsconference
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentDiane Hillmann
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genrobin fay
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Amit Sheth
 

Mais procurados (20)

Making things findable
Making things findableMaking things findable
Making things findable
 
Semantic Search on the Rise
Semantic Search on the RiseSemantic Search on the Rise
Semantic Search on the Rise
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
NISO/DCMI Webinar: Cooperative Authority Control: The Virtual International A...
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Semantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistantsSemantic search: from document retrieval to virtual assistants
Semantic search: from document retrieval to virtual assistants
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015Text REtrieval Conference (TREC) Dynamic Domain Track 2015
Text REtrieval Conference (TREC) Dynamic Domain Track 2015
 
What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
An Introduction to Information Retrieval and Applications
 An Introduction to Information Retrieval and Applications An Introduction to Information Retrieval and Applications
An Introduction to Information Retrieval and Applications
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
Georgios Meditskos and Stamatia Dasiopoulou | Question Answering over Pattern...
 
McDanold-1-jun15
McDanold-1-jun15McDanold-1-jun15
McDanold-1-jun15
 
General Introduction for Semantic Web and Linked Open Data
General Introduction for Semantic Web and Linked Open DataGeneral Introduction for Semantic Web and Linked Open Data
General Introduction for Semantic Web and Linked Open Data
 
Metadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data EnvironmentMetadata Training for Staff and Librarians for the New Data Environment
Metadata Training for Staff and Librarians for the New Data Environment
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
 
semantic web & natural language
semantic web & natural languagesemantic web & natural language
semantic web & natural language
 
Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...Semantic Web in Action: Ontology-driven information search, integration and a...
Semantic Web in Action: Ontology-driven information search, integration and a...
 
Billey Just Because We Can Doesn't Mean We Should
Billey Just Because We Can Doesn't Mean We ShouldBilley Just Because We Can Doesn't Mean We Should
Billey Just Because We Can Doesn't Mean We Should
 

Semelhante a Semantic Search

Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the WebPeter Mika
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Yandex
 
Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13CLARIAH
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologiesenterprisesearchmeetup
 
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web Thanh Tran
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Toine Bogers
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information ArchitectureRob Bogue
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...SEAD
 
Chalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
Chalitha Perera | Cross Media Concept and Entity Driven Search for EnterpriseChalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
Chalitha Perera | Cross Media Concept and Entity Driven Search for Enterprisesemanticsconference
 
cross media concept and entity driven search for enterprise
cross media concept and entity driven search for enterprisecross media concept and entity driven search for enterprise
cross media concept and entity driven search for enterpriseDileepa Jayakody
 
Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010vman916
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 

Semelhante a Semantic Search (20)

Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"
 
Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13Clarin nl odijk-final_event_2015-03-13
Clarin nl odijk-final_event_2015-03-13
 
Relevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search TechnologiesRelevancy and Search Quality Analysis - Search Technologies
Relevancy and Search Quality Analysis - Search Technologies
 
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
 
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
NISO/DCMI September 25 Webinar: Implementing Linked Data in Developing Countr...
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
 
Riley-o.com
Riley-o.comRiley-o.com
Riley-o.com
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Practical Information Architecture
Practical Information ArchitecturePractical Information Architecture
Practical Information Architecture
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
 
Chalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
Chalitha Perera | Cross Media Concept and Entity Driven Search for EnterpriseChalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
Chalitha Perera | Cross Media Concept and Entity Driven Search for Enterprise
 
cross media concept and entity driven search for enterprise
cross media concept and entity driven search for enterprisecross media concept and entity driven search for enterprise
cross media concept and entity driven search for enterprise
 
The Web of Data: The W3C Semantic Web Initiative
The Web of Data: The W3C Semantic Web InitiativeThe Web of Data: The W3C Semantic Web Initiative
The Web of Data: The W3C Semantic Web Initiative
 
Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010Managing your Metadata w/ SharePoint 2010
Managing your Metadata w/ SharePoint 2010
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 

Mais de sssw2012

Collaborative ontology development
Collaborative ontology developmentCollaborative ontology development
Collaborative ontology developmentsssw2012
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real Worldsssw2012
 
The Web of Data - Tom Heath
The Web of Data - Tom HeathThe Web of Data - Tom Heath
The Web of Data - Tom Heathsssw2012
 
Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...
Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...
Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...sssw2012
 
Valentina Presutti - Ontology Design Patterns: an introduction
Valentina Presutti - Ontology Design Patterns: an introductionValentina Presutti - Ontology Design Patterns: an introduction
Valentina Presutti - Ontology Design Patterns: an introductionsssw2012
 
Ivan Herman - Semantic Web Activities @ W3C
Ivan Herman - Semantic Web Activities @ W3CIvan Herman - Semantic Web Activities @ W3C
Ivan Herman - Semantic Web Activities @ W3Csssw2012
 
jerome Euzenat - Ontology Matching
jerome Euzenat - Ontology Matchingjerome Euzenat - Ontology Matching
jerome Euzenat - Ontology Matchingsssw2012
 
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspective
Aldo Gangemi - Meaning on the Web: An Empirical Design PerspectiveAldo Gangemi - Meaning on the Web: An Empirical Design Perspective
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspectivesssw2012
 

Mais de sssw2012 (8)

Collaborative ontology development
Collaborative ontology developmentCollaborative ontology development
Collaborative ontology development
 
Manfred Linking the Real World
Manfred Linking the Real WorldManfred Linking the Real World
Manfred Linking the Real World
 
The Web of Data - Tom Heath
The Web of Data - Tom HeathThe Web of Data - Tom Heath
The Web of Data - Tom Heath
 
Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...
Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...
Linked Data Applications: There is No-One-Size-Fits-All Formula - Asun Gomez ...
 
Valentina Presutti - Ontology Design Patterns: an introduction
Valentina Presutti - Ontology Design Patterns: an introductionValentina Presutti - Ontology Design Patterns: an introduction
Valentina Presutti - Ontology Design Patterns: an introduction
 
Ivan Herman - Semantic Web Activities @ W3C
Ivan Herman - Semantic Web Activities @ W3CIvan Herman - Semantic Web Activities @ W3C
Ivan Herman - Semantic Web Activities @ W3C
 
jerome Euzenat - Ontology Matching
jerome Euzenat - Ontology Matchingjerome Euzenat - Ontology Matching
jerome Euzenat - Ontology Matching
 
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspective
Aldo Gangemi - Meaning on the Web: An Empirical Design PerspectiveAldo Gangemi - Meaning on the Web: An Empirical Design Perspective
Aldo Gangemi - Meaning on the Web: An Empirical Design Perspective
 

Último

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Último (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

Semantic Search

  • 1. Related Entity Finding on the Web Peter Mika Senior Research Scientist Yahoo! Research Joint work with B. Barla Cambazoglu and Roi Blanco
  • 2. - 2 - Yahoo! serves over 700 million users in 25 countries
  • 3. - 3 - Yahoo! Research: visit us at research.yahoo.com
  • 4. - 4 - Yahoo! Research Barcelona • Established January, 2006 • Led by Ricardo Baeza-Yates • Research areas – Web Mining • content, structure, usage – Social Media – Distributed Systems – Semantic Search
  • 5. - 5 - Search is really fast, without necessarily being intelligent
  • 6. - 6 - Why Semantic Search? Part I • Improvements in IR are harder and harder to come by – Machine learning using hundreds of features • Text-based features for matching • Graph-based features provide authority – Heavy investment in computational power, e.g. real-time indexing and instant search • Remaining challenges are not computational, but in modeling user cognition – Need a deeper understanding of the query, the content and/or the world at large – Could Watson explain why the answer is Toronto?
  • 7. - 7 - Poorly solved information needs • Ambiguous searches – Multiple interpretations • jaguar • paris hilton – Secondary meaning • george bush (and I mean the beer brewer in Arizona) – Subjectivity • reliable digital camera • paris hilton sexy – Imprecise or overly precise searches • jim hendler
  • 8. - 8 - Poorly solved information needs • Complex queries – Missing information • brad pitt zombie • florida man with 115 guns • peter mika barcelona • 34 year old computer scientist living in barcelona – Category queries • countries in africa • barcelona nightlife – Complex needs • 120 dollars in euros • digital camera under 300 dollars • world temperature in 2020 Many of these queries would not be asked by users, who learned over time what search technology can and can not do. Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
  • 10. - 10 - Why Semantic Search? II. • The Semantic Web is now a reality – Standardization and development • Standards from the W3C • Tools and services from a variety of vendors – Data • Large amounts of public data published in RDF • Investment in private graphs • Commercial interest – Bing, Google, Yahoo, Yandex, Facebook – Need to satisfy end users • Not trained in logic • Can’t learn complex query languages such as SPARQL • May not know the data • May not be experts in the domain
  • 11. - 11 - This is not a business model http://en.wikipedia.org/wiki/Underpants_Gnomes
  • 12. - 12 - Entity-based Disambiguation
  • 13. - 13 - Entity Search
  • 14. - 14 - Entity-based Navigation / Exploration
  • 15. - 15 - Factual Search
  • 16. - 16 - Facebook Graph Search
  • 17. - 17 - Facebook Graph Search
  • 18. - 18 - Aggregation and computation
  • 19. - 19 - Contextual (pervasive, ambient) search Yahoo! Connected TV: Widget engine embedded into the TV Yahoo! Connected TV: Widget engine embedded into the TV Yahoo! IntoNow: recognize audio and show related content Yahoo! IntoNow: recognize audio and show related content
  • 20. - 20 - Interactive Voice Search • Apple’s Siri – Question-Answering • Variety of backend sources including Wolfram Alpha and various Yahoo! services – Task completion • E.g. schedule an event • Google Now
  • 21. - 21 - Conversational Search • Google’s Interactive Voice Search
  • 22. - 22 - Parlance project • Interactive, conversational voice search – Complex dialogs around a set of objects • Restaurant – Area – Price range – Type of cuisine – Complete system (mixed licence) • Automated Speech Recognition (ASR) • Spoken Language Understanding (SLU) • Interaction Management • Knowledge Base • Natural Language Generation (NLG) • Text-to-Speech (TTS) – Video • Commercial alternatives from Nuance
  • 23. What is Semantic Search?
  • 24. - 24 - What it’s like to be a machine? Roi Blanco
  • 25. - 25 - What it’s like to be a machine? ✜Θ♬♬ţğ ✜Θ♬♬ţğ√∞®ÇĤĪ✜★♬☐✓✓ ţğ★✜ ✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ ≠=⅚©§★✓♪ΒΓΕ℠ ✖Γ ±♫⅜ ⏎↵⏏☐ģğğğμλκσςτ ⏎⌥°¶§ΥΦΦΦ✗✕☐
  • 26. - 26 - Semantic Search: a definition • Semantic search is a retrieval paradigm where – User intent and resources are represented using semantic models • Not just symbolic representations – Semantic models are exploited in the matching and ranking of resources • Often a hybrid of document and data retrieval – Documents with metadata • Metadata may be embedded inside the document • I’m looking for documents that mention countries in Africa. – Data retrieval • Structured data, but searchable text fields • I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs. • Wide range of semantic search systems – Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
  • 27. - 27 - Note about explicit vs. implicit semantics • Semantic Search methods focus on explicit semantics – Elements of text or data are mapped to a substantially different representation according to some theory of language or some theory of the world • As opposed to “bottom-up” or implicit semantics – e.g. Latent Semantic Analysis (LSA)
  • 28. - 28 - Semantic models • Semantics is concerned with the meaning of the resources made available for search – Various representations of meaning • Linguistic models: models of relationships among words – Taxonomies, thesauri, dictionaries of entity names – Natural language structures extracted from text, e.g. using dependency parsing – Inference along linguistic relations, e.g. broader/narrower terms, textual entailment • Conceptual models: models of relationships among objects – Ontologies capture entities in the world and their relationships – Words and phrases in text or records in a database are identified as representations of ontological elements – Inference along ontological relations, e.g. logical entailment
  • 29. - 29 - Semantic Search – a process view Document Representation Knowledge Representation Semantic Models Resources Documents
  • 30. - 30 - Semantic Search as a research field • Intersection of Semantic Web, Information Retrieval, Natural Language Processing and Databases • Emerging field of research – Workshops • Exploiting Semantic Annotations in Information Retrieval (2008-2012) at CIKM • Semantic Search (SemSearch) workshop series (2008-2011) at ESWC and WWW • Entity-oriented search workshop (2010-2011) at SIGIR • Joint Intl. Workshop on Semantic and Entity-oriented Search (2012) at SIGIR • Semantic Search Workshop (2011-2013) at VLDB – Special Issues of journals – Surveys • Related fields: – XML retrieval – Keyword search in databases – Natural Language Retrieval
  • 31. - 31 - Comparison to IR and DB • Information Retrieval (IR) support the retrieval of documents (document retrieval) – Representation based on lightweight syntax-centric models – Work well for topical search – Not so well for more complex information needs – Web scale • Database (DB) and Knowledge-based Systems (KB) deliver more precise answers (data retrieval) – More expressive models – Allow for complex queries – Retrieve concrete answers that precisely match queries – Not just matching and filtering, but also joins and more complex inference – Limitations in scalability
  • 32. - 32 - Search is required in the presence of ambiguity Query Data KeywordsKeywords NL Questions NL Questions Form- / facet- based Inputs Form- / facet- based Inputs Structured Queries (SPARQL) Structured Queries (SPARQL) OWL ontologies with rich, formal semantics OWL ontologies with rich, formal semantics Structured RDF data Structured RDF data Semi- Structured RDF data Semi- Structured RDF data RDF data embedded in text (RDFa) RDF data embedded in text (RDFa) Ambiguities: interpretation Ambiguities: interpretation, extraction errors, data quality, confidence/trust
  • 33. - 33 - list search related entity finding entity search SemSearch 2010/11 list completion SemSearch 2011 TREC ELC taskTREC REF-LOD task semantic search Common tasks in Semantic Search
  • 35. - 35 - Motivation • Some users are short on time – Need for direct answers – Query expansion, question-answering, information boxes, rich results… • Other users have time at their hand – Long term interests such as sports, celebrities, movies and music – Long running tasks such as travel planning
  • 36. - 36 - Example user sessions
  • 37. - 37 - Spark: related entity recommendations in web search • A search assistance tool for exploration • Recommend related entities given the user’s current query – Cf. Entity Search at SemSearch, TREC Entity Track • Ranking explicit relations in a Knowledge Base – Cf. TREC Related Entity Finding in LOD (REF-LOD) task • A previous version of the system live since 2010 • van Zwol et al.: Faceted exploration of image search results. WWW 2010: 961-970
  • 38. - 38 - Spark example I.
  • 39. - 39 - Spark example II.
  • 40. - 40 - How does it work? /user/torzecn/ Shared/Entity- Relationship_Graphs Yalinda feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/{1,2,3}/ yalinda/phyfacet/data /projects/gridfaces/ feed/yahooomg/ Y! OMG feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/6/yahooomg/ phyfacet/data /projects/gridfaces/ feed/geo Y! Geo feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/data /projects/gridfaces/ feed/yahootv Y! TV feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/5 /projects/gridfaces/ feed/editorialdata/ data/objects Editorial feed parser (GFFeedMapper, GFEditorialMergeReducer) $VIS_GRID_HOME/ gfstorageall/logicalobj/ dataArchive/ $VIS_GRID_HOME/ gfstorageall/ logicalfacet/data $VIS_GRID_HOME/ gfstorageall/logicalobj/ data /projects/gridfaces/ feed/editorialdata/ data/facets GFDumpMain-First (DumpFirstMapper, DumpFirstReducer) Empty/missing Activated Deactivated $VIS_GRID_HOME/ rankinput STEP 1: FEED PARSING PIPELINE GFDumpMain-Second (DumpSecondMapper, DumpSecondReducer) $VIS_GRID_HOME/ rankinputTmp CreateDictionary (DictionaryMapper, DistinctReducer) $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/ dictionary $VIS_GRID_HOME/ rankprepout/spark/ logs/week{WNo} /data/SDS/data/ search_US CreateIntermediate LogFormat (NullValueSillyMapp er, DistinctReducer) CreateQuerySessions CommonModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions lfi terer CreateQuerySessions CommonModel (SillyMapper, QuerySessionsCreate SessionsReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions_cm CreateQueryTermsJoin DictionaryAndLogs (LeftJoinMapper, LeftJoinReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qtermsjoindictionarylog CreateQueryTermsJoin QTermsAndLogs (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qterms_cm CreateFlickrIntermediate LogFormat (FlickrReformatMapper) /projects/gridfaces/ ifl ckr/feed $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckrtagsintermediate CreateFlickrTagsCommon ModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckrtags lfi terer CreateFlickrTagsCommon ModelFilter (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckr_cm General Query logs Flickr Twitter /projects/rtds/twitter/ refi hose CreateTwitterIntermediate LogFormat (NullValueSillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/spark/ twitter_logs/week{WNo} CreateTweetsCommon ModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweets lfi terer CreateTweetsCommon Model (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweets_cm DistinctUsersTwitter (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/tweets/ distinctusers DistinctUsers ifl ckr (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckr/ distinctusers DistinctUsersQSessions (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions/distinctusers DistinctUsersQTerms (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/qterms/ distinctusers CountUsers (CounterMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/tmp/ distinctusers CountEvents (CounterMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countevents STEP 2: PREPROCESSING PIPELINE (before feature extraction and ranking) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ probability EventProbabilityQTerms (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qterms_cm $VIS_GRID_HOME/ rankprepout/tmp/ countevents EventConditionalProbabilityQterms (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionalprobability EventJointProbabilityQterms (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointprobability EventJointUserProbabilityQTerms (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointuserprobability EventConditionalUserProb abilityQterms (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/qterms ConditionalUserProbability 2_3-qt (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ qterms/query $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ qterms/queryfacet EventConditionalUser Probability3_qterms (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/qterms/ week{WNo}/ conditionaluserprobability EventEntropyQTerms (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ entropy EventPMI1QTerms (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmiqterms EventPMIQTerms (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ pmi EventKLDivergenceUnary1QTerms (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunaryqterms EventKLDivergenceUnaryQTerms (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ kldivergenceunary EventCosineSimilarityQTerms (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ cosinesimilarity Extracted features Others STEP 3a: FEATURE EXTRACTION (QUERY TERMS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ probability EventProbabilityFlickr (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckr_cm $VIS_GRID_HOME/ rankprepout/tmp/ countevents EventConditionalProbabilityFlickr (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ conditionalprobability EventJointProbabilityFlickr (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointprobability EventJointUserProbabilityFlickr (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointuserprobability EventConditionalUserProb abilityFlickr (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/ ifl ckr ConditionalUserProbability 2_3-fl (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ ifl ckr/query $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ ifl ckr/queryfacet EventConditionalUser Probability3_ ifl ckr (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ conditionaluserprobability EventEntropyFlickr (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ entropy EventPMI1Flickr (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmi ifl ckr EventPMIFlickr (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ pmi EventKLDivergenceUnary1Flickr (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunary ifl ckr EventKLDivergenceUnaryFlickr (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ kldivergenceunary EventCosineSimilarityFlickr (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ cosinesimilarity Extracted features Others STEP 3c: FEATURE EXTRACTION (FLICKR TAGS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ probability EventProbabilityTweets (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweets_cm $VIS_GRID_HOME/ rankprepout/tmp/ countevents EventConditionalProbabilityTweets (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ conditionalprobability EventJointProbabilityTweets (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointprobability EventJointUserProbabilityTweets (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointuserprobability EventConditionalUserProb abilityTweets (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/tweets ConditionalUserProbability 2_3-tw (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ tweets/query $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ tweets/queryfacet EventConditionalUser Probability3_tweets (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/tweets/ week{WNo}/ conditionaluserprobability EventEntropyTweets (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ entropy EventPMI1Tweets (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmitweets EventPMITweets (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ pmi EventKLDivergenceUnary1Tweets (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunarytweets EventKLDivergenceUnary1Tweets (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ kldivergenceunary EventCosineSimilarityTweets (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ cosinesimilarity Extracted features Others STEP 3d: FEATURE EXTRACTION (TWEETS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ probability unaryfeaturemerger1_qterms (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_qterms $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ pmi $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ cosinesimilarity STEP 4a: FEATURE MERGING (QUERY TERMS) PIPELINE unaryfeaturemerger2_qterms (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_qterms symmetricfeaturemerger_qterms (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_qterms asymmetricfeaturemerger_qterms (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_qterms reverseasymmetricfeaturemerger_qterms (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_ qterms Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ probability unaryfeaturemerger1_qsessions (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_qsessions $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/week{WNo}/ conditionaluserprobability $VIS_GRID_HOME /rankprepout/spark/ qsessions/ week{WNo}/entropy $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/pmi $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ qsessions/week{WNo}/ cosinesimilarity STEP 4b: FEATURE MERGING (QUERY SESSIONS) PIPELINE unaryfeaturemerger2_qsessions (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_qsessions symmetricfeaturemerger_qsessions (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_qsessions asymmetricfeaturemerger_qsessions (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_qsessions reverseasymmetricfeaturemerger_qsessions (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_ qsessions Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ probability unaryfeaturemerger1_ ifl ckr (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_ ifl ckr $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/pmi $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ cosinesimilarity STEP 4c: FEATURE MERGING (FLICKR TAGS) PIPELINE unaryfeaturemerger2_ ifl ckr (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_ ifl ckr symmetricfeaturemerger_ ifl ckr (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_ ifl ckr asymmetricfeaturemerger_ ifl ckr (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_ ifl ckr reverseasymmetricfeaturemerger_ ifl ckr (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_f lickr Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ probability unaryfeaturemerger1_tweets (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_tweets $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/tweets/ week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ pmi $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ cosinesimilarity STEP 4d: FEATURE MERGING (TWEETS) PIPELINE unaryfeaturemerger2_tweets (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_tweets symmetricfeaturemerger_tweets (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_tweets asymmetricfeaturemerger_tweets (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_tweets reverseasymmetricfeaturemerger_tweets (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_t weets Features Others $VIS_GRID_HOME/ rankinput STEP 5a: FEATURE EXTRACTION AND MERGING (COMBINED FEATURES) PIPELINE combinedfeaturem erger_qsessions (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_qsessions $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_qsessions $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_qterms combinedfeaturem erger_qterms (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_qterms $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_ ifl ckr combinedfeaturem erger_ ifl ckr (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_ ifl ckr $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_tweets combinedfeaturem erger_tweets (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_tweets joinfeatures (JoinerMapper, JoinerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatures3 STEP 5b: FEATURE EXTRACTION AND MERGING (GRAPH) PIPELINE Features Others $VIS_GRID_HOME /rankprepout/tmp/ statsmerge/ joinfeatures3 $VIS_GRID_HOME/ rankinput graph_sharedconnect_1 (GRSharedConnectMapper, GRSharedConnectReducer) $VIS_GRID_HOME/ rankprepout/graph/ sharedconnect graph_sharedconnect_2 (NullValueSillyMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/graph/ sharedconnect_2 join_graph (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_g_1 graph_popularity_rank_all (GRPopularityRankMapper, GRPopularityRankReducer) $VIS_GRID_HOME/ rankprepout/graph/ popularity_rank_all graph_popularity_rank_directed (GRPopularityRankMapper, GRPopularityRankReducer) $VIS_GRID_HOME/ rankprepout/graph/ popularity_rank_directed unaryfeaturemerger_entpopmov (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop1 unaryfeaturemerger_entpopmov2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME /rankprepout/tmp/ statsmerge/ joinfeature_pop2 unaryfeaturemerger_entpopmov3 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop3 unaryfeaturemerger_entpopmov4 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop4 STEP 5c: FEATURE MERGING (POPULARITY) PIPELINE Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop4 WebCitationTotalHits Normalization (SillyMapper) $VIS_GRID_HOME/ web_citation/ total_hits $VIS_GRID_HOME/ rankprepout/tmp/ webcitation_totalhits WebCitationDeepHits Normalization (SillyMapper) $VIS_GRID_HOME/ web_citation/ deep_hits $VIS_GRID_HOME/ rankprepout/tmp/ webcitation_deephits asymmetricfeaturemerger_webcitation (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_webcitation coverage (QueryCountMapper, QueryCountReducer) $VIS_GRID_HOME/ rankprepout/spark/ coverage/week{WNo} $VIS_GRID_HOME/ rankprepout/spark/ logs/week{WNo} joincov1 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeaturescov1 joincov2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeaturescov2 joinwikipop1 (MergeFeaturesMapper, UnaryFeatureMergerReducer) /user/barla/Spark/ wikiResultCounts $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatureswikipop1 joinwikipop2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatureswikipop2 jointypes (EntityRelationTypeMapper, EntityRelationTypeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/jointypes STEP 6: RANKING PIPELINE MLR scoring (/homes/barla/gf_mlr/ mlr_rank) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/jointypes /homes/barla/gf_mlr/ gbrank.xml /homes/barla/gf_mlr/ header.tsv $VIS_GRID_HOME/ rankprepout/spark/ ranking MLR scoring (SillyMapper, DisambiguationReducer) $VIS_GRID_HOME/ rankprepout/spark/ ranking-disambiguated groupmax (SillyMapper, OutputMaxReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ranking formatranking (RankingFormatterMapper, RankingFormatterReducer) $VIS_GRID_HOME/ rankprepout/spark/ ranking nfi al STEP 7: DATAPACK GENERATION PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ ranking nfi al mergerank (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfrankout $VIS_GRID_HOME/ gfrankout/1/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/2/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/3/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/4/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/5/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout10/geo/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/1/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/2/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/3/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/4/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/5/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/1/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/2/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/3/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/4/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/5/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/ logicalobj/data dumper1 (DumpFirstMapper, DumpFirstReducer) $VIS_GRID_HOME/ platdumpTmp dumper2 (DumpSecondMapper, DumpSecondReducer) $VIS_GRID_HOME/ platdump datapack (Mapper, YISFacetMergeReducer) $VIS_GRID_HOME/ datapack
  • 41. - 41 - High-Level Architecture View Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features
  • 42. - 42 - Spark Architecture Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features
  • 44. - 44 - Entity graph • 3.4 million entities, 160 million relations • Locations: Internet Locality, Wikipedia, Yahoo! Travel • Athletes, teams: Yahoo! Sports • People, characters, movies, TV shows, albums: Dbpedia • Example entities • Dbpedia Brad_Pitt Brad Pitt Movie_Actor • Dbpedia Brad_Pitt Brad Pitt Movie_Producer • Dbpedia Brad_Pitt Brad Pitt Person • Dbpedia Brad_Pitt Brad Pitt TV_Actor • Dbpedia Brad_Pitt_(boxer) Brad Pitt Person • Example relations • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor
  • 45. - 45 - Entity graph challenges • Coverage of the query volume – New entities and entity types – Additional inference – International data – Aliases, e.g. jlo, big apple, thomas cruise mapother iv • Freshness – People query for a movie long before it’s released • Irrelevant entity and relation types – E.g. voice actors who co-acted in a movie, cities in a continent • Data quality – United States Senate career of Barack Obama is not a person – Andy Lau has never acted in Iron Man 3
  • 47. - 47 - Feature extraction from text • Text sources – Query terms – Query sessions – Flickr tags – Tweets • Common representation Input tweet: Brad Pitt married to Angelina Jolie in Las Vegas Output event: Brad Pitt + Angelina Jolie Brad Pitt + Las Vegas Angelina Jolie + Las Vegas
  • 48. - 48 - Features • Unary – Popularity features from text: probability, entropy, wiki id popularity … – Graph features: PageRank on the entity graph, wikipedia, web graph – Type features: entity type • Binary – Co-occurrence features from text: conditional probability, joint probability … – Graph features: common neighbors … – Type features: relation type
  • 49. - 49 - Feature extraction challenges • Efficiency of text tagging – Hadoop Map/Reduce • More features are not always better – Can lead to over-fitting without sufficient training data
  • 51. - 51 - Model Learning • Training data created by editors (five grades) 400 Brandi adriana lima Brad Pitt person Embarassing 1397 David H. andy garcia Brad Pitt person Mostly Related 3037 Jennifer benicio del toro Brad Pitt person Somewhat Related 4615 Sarah burn after reading Brad Pitt person Excellent 9853 Jennifer fight club movie Brad Pitt person Perfect • Join between the editorial data and the feature file • Trained a regression model using GBDT –Gradient Boosted Decision Trees • 10-fold cross validation optimizing NDCG and tuning •number of trees •number of nodes per tree
  • 52. - 52 - Feature importance RANK FEATURE IMPORTANCE 1 Relation type 100 2 PageRank (Related entity) 99.6075 3 Entropy – Flickr 94.7832 4 Probability – Flickr 82.6172 5 Probability – Query terms 78.9377 6 Shared connections 68.296 7 Cond. Probability – Flickr 68.0496 8 PageRank (Entity) 57.6078 9 KL divergence – Flickr 55.4604 10 KL divergence – Query terms 55.0662
  • 53. - 53 - Impact of training data Number of training instances (judged relations)
  • 54. - 54 - Performance by query-entity type •High overall performance but some types are more difficult •Locations – Editors downgrade popular entities such as businesses NDCG by type of the query entity
  • 55. - 55 - Model Learning challenges • Editorial preferences not necessarily coincide with usage – Users click a lot more on people than expected – Image bias? • Alternative: optimize for usage data – Clicks turned into labels or preferences – Size of the data is not a concern – Gains are computed from normalized CTR/COEC – See van Zwol et al. Ranking Entity Facets Based on User Click Feedback. ICSC 2010: 192-199. couple of hundred entities and their facets we find that linear combination of the conditional probabilities gives t performance on the collected judgements using wqt = 2, = 0.5, and wf t = 1. However, the editorial data was not stantial enough to learn a ranking with GBDT. Click-through Rate versus Click over Expected Click From the image search query logs, we collect the user click a that is related to the facets. This allows us to compute the ck-through rate (CTR) on a facet for a given entity that is ected in a user query and for which the facets were shown he user. Let clickse,f be the number of clicks on a facet ty f show in relation to entity e, and viewse,f the number times the facet f is shown to a user for a related entity e, n the probability of a click on a facet entity f for a given ty e can be modelled as ctre,f : ctre,f = clickse,f viewse,f (2) n Figure 3 the conditional click-through rate is shown for first ten positions. It shows the CTR per position for every ge view where one of the facets is clicked, aggregated over coece,f = cl PP p=1 vi Zhang and Jones [3] refer to expected clicks, based on the de expected clicks given the positio C. Gradient Boosted Decision Tr Stochastic gradient boosted dec the most widely used learning alg today. Gradient tree boosting co sion model, utilizing decision tr One advantage over other learn trees in general is that the feat are highly interpretable. GBDT different loss functions can be u research presented here we used our loss function. In related work, pairwise and ranking specific lo well at improving search relevanc shallow decision trees, trees in s on a randomly selected subset of prone to over-fitting [14]. For the shown in the search engine he ground truth for creating set used by the gradient onal Probabilities of the facets search expe- unction rank(e, f) that is onal probabilities extracted ⇥Pqs(f|e)+wf t ⇥Pf t (f, e) (1) e) are the conditional prob- he weights for the different qt), query session (qs) and l judgements collected for their facets we find that ditional probabilities gives udgements using wqt = 2, the editorial data was not ng with GBDT. k over Expected Click gs, we collect the user click all entities. Observe that the CTR declines when the position at which a facet is shown increases. We introduce a second click model, based on the notion of clicks over expected clicks (COEC). To allows us to deal with the so called position bias – where facets appearing in lower positions are less likely to be clicked even if they are relevant [2]. This phenomenon isoften observed in Web search and we adopt the COEC model proposed by Chapelle and Zhang [11]. In that model, we estimate ctrp as the aggregated ctr – over all queries and sessions – in position p for all positions P. Let then clickse,f be the number of clicks on a facet entity f show in relation to entity e, and viewse,f p the number of times the facet f is shown to a user for a related entity e at position p. The probability of a click over expected click on a facet entity f for a given entity e can then be modelled as coece,f : coece,f = clickse,f PP p=1 viewse,f p ⇥ ctrp (3) Zhang and Jones [3] refer to this method as clicks over expected clicks, based on the denominator that includes the expected clicks given the positions that the url appeared in. C. Gradient Boosted Decision Trees Stochastic gradient boosted decision trees (GBDT) is one of
  • 57. - 57 - Ranking and Disambiguation • We apply the ranking function offline to the data • Disambiguation – How many times a given wiki id was retrieved for queries containing the entity name? Brad Pitt Brad_Pitt 21158 Brad Pitt Brad_Pitt_(boxer) 247 XXX XXX_(movie) 1775 XXX XXX_(Asia_album) 89 XXX XXX_(ZZ_Top_album) 87 XXX XXX_(Danny_Brown_album) 67 – PageRank for disambiguating locations (wiki ids are not available) • Expansion to query patterns – Entity name + context, e.g. brad pitt actor
  • 58. - 58 - Ranking and Disambiguation challenges • Disambiguation cases that are too close to call – Fargo Fargo_(film) 3969 – Fargo Fargo,_North_Dakota 4578 • Disambiguation across Wikipedia and other sources
  • 59. - 59 - Evaluation #2: Side-by-side testing • Comparing two systems – A/B comparison, e.g. current system under development and production system – Scale: A is better, B is better • Separate tests for relevance and image quality – Image quality can significantly influence user perceptions – Images can violate safe search rules • Classification of errors – Results: missing important results/contains irrelevant results, too few results, entities are not fresh, more/less diverse, should not have triggered – Images: bad photo choice, blurry, group shots, nude/racy etc. • Notes – Borderline, set one entities relate to the movie Psy but the query is most likely about Gangnam style – Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s musicians. – There is absolutely no relation between Finland and California.
  • 60. - 60 - Evaluation #3: Bucket testing • Also called online evaluation – Comparing against baseline version of the system – Baseline does not change during the test • Small % of search traffic redirected to test system, another small % to the baseline system • Data collection over at least a week, looking for stat. significant differences that are also stable over time • Metrics in web search – Coverage and Click-through Rate (CTR) – Searches per browser-cookie (SPBC) – Other key metrics should not impacted negatively, e.g. Abandonment and retry rate, Daily Active Users (DAU), Revenue Per Search (RPS), etc.
  • 61. - 61 - Coverage before and after the new system 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Days Coverage Coverage before Spark Trend before Spark Coverage after Spark Trend after Spark Spark is deployed in production Before release: Flat, lower After release: Flat, higher
  • 62. - 62 - Click-through rate (CTR) before and after the new system Before release: Gradually degrading performance due to lack of fresh data After release: Learning effect: users are starting to use the tool again 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Days CTR CTR before Spark Trend before Spark CTR after Spark Trend after Spark Spark is deployed in production
  • 63. - 63 - Summary • Spark – System for related entity recommendations • Knowledge base • Extraction of signals from query logs and other user-generated content • Machine learned ranking • Evaluation • Other applications – Recommendations on topic-entity pages
  • 64. - 64 - Future work • New query types – Queries with multiple entities • adele skyfall – Question-answering on keyword queries • brad pitt movies • brad pitt movies 2010 • Extending coverage – Spark now live in CA, UK, AU, NZ, TW, HK, ES • Even fresher data – Stream processing of query log data • Data quality improvements • Online ranking with post-retrieval features
  • 65. - 65 - The End • Many thanks to – Barla Cambazoglu and Roi Blanco (Barcelona) – Nicolas Torzec (US) – Libby Lin (Product Manager, US) – Search engineering (Taiwan) • Contact – pmika@yahoo-inc.com – @pmika • Internships available for PhD students

Notas do Editor

  1. In fact, some of these searches are so hard that the users don’t even try them anymore
  2. In fact, some of these searches are so hard that the users don’t even try them anymore
  3. With ads, the situation is even worse due to the sparsity problem. Note how poor the ads are…
  4. Challlenges “ so someone could search for ‘volcanic eruptions in the 18th century’ or ‘Lady Gaga concerts in a warm outdoor location.’” For now, there’s no way for Google or Microsoft to reliably parse those sorts of queries against entities’ properties. And Microsoft’s Weitz says that “people have been beaten down in search for so long, they’d never ask” those sorts of questions in the first place. Complex relational queries are the sorts of things that Bing and Google tend to push off to their more vertical search areas, like Bing’s shopping and travel search and Google’s Shopping and Flights.
  5. This is how a human sees the world.
  6. This is how a machine sees the world… Machines are not ‘intelligent’ and can not ‘read’… they just see a string of symbols and try to match the users input to that stream.
  7. Semantic search can be seen as a retrieval paradigm Centered on the use of semantics Incorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  8. Conceptual models, e.g. ER models, capture relations between data elements , which essentially capture relationships between syntactic elements and their interpretations
  9. - In recent years we have witnessed tremendous interest and substantial economic exploitation of search technologies, both at web and enterprise scale. In this regard, technologies for Information Retrieval (IR) can be distinguished from solutions in the field of Database (DB) and Knowledge-based Expert Systems (KB). Whereas IR applications support the retrieval of documents (document retrieval), DB and KB systems deliver more precise answers (data retrieval). The technical differences between these two paradigms can be broken down into three main dimensions, i.e. the representation of the user need ( query model ), the underlying resources ( data model ) and the matching technique . The representation of user need and resource content in current IR systems is still almost exclusively achieved by the lightweight syntax-centric models such as the predominant keyword paradigm (i.e. keyword queries matched against bag-of-words document representation). While these systems have shown to work well for topical search, i.e. retrieve documents based on a topic, they usually fail to address more complex information needs. Using more expressive models for the representation of the user need, and leveraging the structure and semantics inherent in the data, DB and KB systems allow for complex queries, and for the retrieval of concrete answers that precisely match them.