1. Semantic Search: from document
retrieval to Virtual Assistants
P R E S E N T E D B Y P e t e r M i k a , D i r e c t o r o f R e s e a r c h , Y a h o o L a b s ⎪ M a r c h 2 0 , 2 0 1 5
2. The Semantic Web (2001-)
3/21/20152
Part of Tim Berners-Lee’s
original proposal for the Web
Beginning of a research community
› Formal ontology
› Logical reasoning
› Agents, web services
Rough start in deployment
› Misplaced expectations
› Lack of adoption
3. The Semantic Web, May 2001
“At the doctor's office, Lucy instructed her
Semantic Web agent through her handheld Web
browser. The agent promptly retrieved
information about Mom's prescribed treatment
from the doctor's agent, looked up several lists
of providers, and checked for the ones in-plan
for Mom's insurance within a 20-mile radius of
her home and with a rating of excellent or very
good on trusted rating services. It then began
trying to find a match between available
appointment times (supplied by the agents of
individual providers through their Web sites) and
Pete's and Lucy's busy schedules.”
(The emphasized keywords indicate terms
whose semantics, or meaning, were defined for
the agent through the Semantic Web.)
3/21/20153
Misplaced expectations?
4. Lack of adoption
Standardization ahead of adoption
› URI, RDF, RDF/XML, RDFa, JSON-LD,
OWL, RIF, SPARQL, OWL-S, POWDER …
Chicken and egg problem
› No users/use cases, hence no data
› No data, because no users/use cases
By 2007, some modest progress
› Metadata in HTML: microformats
› Linked Data: simplifying the stack
5. Web search by 2007
5
Large classes of queries are solved to perfection
Improvements in web search are harder and harder to come by
› Relevance models, hyperlink structure and interaction data
› Combination of features using machine learning
› Heavy investment in computational power
• real-time indexing, instant search, datacenters and edge services
6. Language issues
› Multiple interpretations
• jaguar
• paris hilton
› Secondary meaning
• george bush (and I mean the beer brewer
in Arizona)
› Subjectivity
• reliable digital camera
• paris hilton sexy
› Imprecise or overly precise searches
• jim hendler
Complex needs
› Missing information
• brad pitt zombie
• florida man with 115 guns
• 35 year old computer scientist living in
barcelona
› Category queries
• countries in africa
• barcelona nightlife
› Transactional or computational queries
• 120 dollars in euros
• digital camera under 300 dollars
• world temperature in 2020
Poorly solved information needs remain
Many of these queries would
not be asked by users, who
learned over time what search
technology can and can not
do.
7. Web search by 2007
7
Are there even any true keyword queries?
› Lyrics, quotes and bugs… anything else?
Remaining challenges are not computational, but in modeling user
cognition
› Need a deeper understanding of the query, the content and/or the world at large
8. Microsearch internal prototype (2007)
Personal and
private
homepage
of the same
person
(clear from the
snippet but it
could be also
automatically
de-duplicated)
Conferences
he plans to attend
and his vacations
from homepage
plus bio events
from LinkedIn
Geolocation
9. Enhanced Results
Computing abstracts is hard
› Summarization of HTML
• Template detection
• Selecting relevant snippets
• Composing readable text
› Efficiency constraints
Structured data to replace or complement text summary
› Key/value pairs
› Deep links
› Image or Video
10. Yahoo SearchMonkey (2008)
1. Extract structured data
› Semantic Web markup
• Example:
<span property=“vcard:city”>Santa Clara</span>
<span property=“vcard:region”>CA</span>
› Information Extraction
2. Presentation
› Fixed presentation templates
• One template per object type
› Applications
• Third-party modules to display data (SearchMonkey)
11. Effectiveness of enhanced results
Explicit user feedback
› Side-by-side editorial evaluation (A/B testing)
• Editors are shown a traditional search result and enhanced result for the same page
• Users prefer enhanced results in 84% of the cases and traditional results in 3% (N=384)
Implicit user feedback
› Click-through rate analysis
• Long dwell time limit of 100s (Ciemiewicz et al. 2010)
• 15% increase in ‘good’ clicks
› User interaction model
• Enhanced results lead users to relevant documents (IV) even though less likely to clicked than
textual (III)
• Enhanced results effectively reduce bad clicks!
See
› Kevin Haas, Peter Mika, Paul Tarjan, Roi Blanco: Enhanced results for web search. SIGIR
2011: 725-734
12. Adoption among consumers of web content
Google announces Rich Snippets - June, 2009
› Faceted search for recipes - Feb, 2011
Bing tiles – Feb, 2011
Facebook’s Like button and the Open Graph Protocol (2010)
› Shows up in profiles and news feed
› Site owners can later reach users who have liked an object
13. schema.org
Agreement on a shared set of schemas for common types of web
content
› Bing, Google, and Yahoo! as initial founders (June, 2011)
• Yandex joins schema.org in Nov, 2011
› Similar in intent to sitemaps.org
• Use a single format to communicate the same information to all three search engines
schema.org covers areas of interest to all search engines
› Business listings (local), creative works (video), recipes, reviews and more
› Microdata, RDFa, JSON-LD syntax
Collaborative effort
› Growing number of 3rd party contributions
› schema.org discussions at public-vocabs@w3.org
14. Adoption among publishers of content
R.V. Guha: Light at the end of the tunnel (ISWC 2013 keynote)
› Over 15% of all pages now have schema.org markup
› Over 5 million sites, over 25 billion entity references
› In other words
• Same order of magnitude as the web
See also
› P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012
• Based on Bing US corpus
• 31% of webpages, 5% of domains contain some metadata
› WebDataCommons
• Based on CommonCrawl Nov 2013
• 26% of webpages, 14% of domains contain some metadata
16. Yahoo’s Knowledge Graph
Chicago Cubs
Chicago
Barack Obama
Carlos Zambrano
10% off tickets
for
plays for
plays in
lives in
Brad Pitt
Angelina Jolie
Steven Soderbergh
George Clooney
Ocean’s Twelve
partner
directs
casts in
E/R
casts
in
takes place in
Fight Club
casts in
Dust Brothers
casts
in
music by
Nicolas Torzec: Making knowledge reusable at Yahoo!:
a Look at the Yahoo! Knowledge Base (SemTech 2013)
17. Information extraction and reconciliation
Information extraction
› Automated information extraction
• e.g. wrapper induction
› Metadata from HTML pages
• Focused crawler
› Public datasets (e.g. Dbpedia)
› Proprietary data
Data fusion
› Manual mapping from the source schemas to the
ontology
› Supervised entity reconciliation
• Kedar Bellare, Carlo Curino, Ashwin
Machanavajihala, Peter Mika, Mandar Rahurkar,
Aamod Sane:
WOO: A Scalable and Multi-tenant Platform for
Continuous Knowledge Base Synthesis. PVLDB 2013
• Michael J. Welch, Aamod Sane, Chris Drome: Fast
and accurate incremental entity resolution relative to
an entity knowledge base. CIKM 2012
Ontology management
› Editorially maintained OWL ontology with 300+
classes
› Covering the domains of interest of Yahoo
Curation and quality assessment
› Editors and user feedback still play a large role
18. Semantic Search
Active research field at the intersection of IR, NLP, DB and SemWeb
› ESAIR at SIGIR, SemSearch at ESWC/WWW, EOS and JIWES at SIGIR, Semantic Search
at VLDB
Exploiting semantic understanding in the retrieval process
› User intent and resources are represented using semantic models
• Not just symbolic representations
› Semantic models are exploited in the matching and ranking of resources
Tasks
› information extraction
› information reconciliation/tracking
› query understanding
› retrieving/ranking entities/attributes/relations
› result presentation
19. Semantic Search – a process view
Query
Constructi
on
•Keywords
•Forms
•NL
•Formal language
Query
Processin
g
•IR-style matching & ranking
•DB-style precise matching
•KB-style matching & inferences
Result
Presentation
•Query visualization
•Document and data presentation
•Summarization
Query
Refinement
•Implicit feedback
•Explicit feedback
•Incentives
Document Representation
Knowledge Representation
Semantic Models
Resources
Documents
20. Semantic understanding
23
Documents
› Text in general
• Exploiting natural language structure and semantic coherence
› Specific to the Web
• Exploiting structure of web pages, e.g. annotation of web tables
Queries
› Short text and no structure… nothing to do?
21. Semantic understanding of queries
24
Entities play an important role
› [Pound et al, WWW 2010], [Lin et al WWW 2012]
› ~70% of queries contain a named entity (entity mention queries)
• brad pitt height
› ~50% of queries have an entity focus (entity seeking queries)
• brad pitt attacked by fans
› ~10% of queries are looking for a class of entities
• brad pitt movies
Entity mention query = <entity> {+ <intent>}
› Intent is typically an additional word or phrase to
• Disambiguate, most often by type e.g. brad pitt actor
• Specify action or aspect e.g. brad pitt net worth, toy story trailer
23. oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org
Annotation over sessions
Sports team
Movie
Actor
24. list search
related entity finding
entity search
SemSearch 2010/11
list completion
SemSearch 2011
TREC ELC taskTREC REF-LOD task
entity retrieval
Common tasks in Semantic Search
question-answering
QALD 2012/13/14
document retrieval
e.g. Dalton et al SIGIR 2014
25. Entity-seeking queries make up
40-50% of the query volume
› Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc
object retrieval in the web of data. WWW 2010: 771-
780
› Thomas Lin, Patrick Pantel, Michael Gamon, Anitha
Kannan, Ariel Fuxman: Active objects: actions for
entity-centric search. WWW 2012: 589-598
Show a summary of the most
likely information-needs
› Including related entities for navigation
› Roi Blanco, Berkant Barla Cambazoglu,
Peter Mika, Nicolas Torzec: Entity
Recommendations in Web Search.
ISWC 2013
Application:
entity displays in web search
28. Mobile search on the rise
Information access on-the-go requires hands-free operation
› Driving, walking, gym, etc.
• Americans spend 540 hours a year in their cars [1] vs. 348 hours browsing the Web [2]
~50% of queries are coming from mobile devices (and growing)
› Changing habits, e.g. iPad usage peaks before bedtime
› Limitations in input/output
[1] http://answers.google.com/answers/threadview?id=392456
[2] http://articles.latimes.com/2012/jun/22/business/la-fi-tn-top-us-brands-news-web-sites-20120622
29. Mobile search challenges and opportunities
35
Interaction
› Question-answering
› Support for interactive retrieval
› Spoken-language access
› Task completion
Contextualization
› Personalization
› Geo
› Context (work/home/travel)
• Try getaviate.com
30. Interactive, conversational voice search
Parlance EU project
› Complex dialogs within a domain
• Requires complete semantic understanding
Complete system (mixed license)
› Automated Speech Recognition (ASR)
› Spoken Language Understanding (SLU)
› Interaction Management
› Knowledge Base
› Natural Language Generation (NLG)
› Text-to-Speech (TTS)
Video
31. Task completion
37
We would like to help our users in task completion
› But we have trained our users to talk in nouns
• Retrieval performance decreases by adding verbs to queries
› We need to understand what the available actions are
Modeling actions
› Understand what actions can be taken on a page
› Help users in mapping their query to potential actions
› Applications in web search, email etc.
THING
THING
Schema.org v1.2
including Actions
published
April 16, 2014
33. Q&A
Many thanks to members of the Semantic Search team
at Yahoo Labs Barcelona and to Yahoos around the world
Contact me
› pmika@yahoo-inc.com
› @pmika
› http://www.slideshare.net/pmika/