Semantic Search overview at SSSW 2012

Semantic Search
Peter Mika
Senior Research Scientist
Yahoo! Research

With contributions from Thanh Tran (KIT)

Yahoo! serves over 700 million users in 25 countries

-2-

Yahoo! Research: visit us at research.yahoo.com

-3-

Yahoo! Research Barcelona

• Established January, 2006
• Led by Ricardo Baeza-Yates
• Research areas
– Web Mining
• content, structure, usage
– Social Media
– Distributed Systems
– Semantic Search

-4-

Search is really fast, without necessarily being intelligent

-5-

Why Semantic Search? Part I

• Improvements in IR are harder and harder to come by
– Machine learning using hundreds of features
• Text-based features for matching
• Graph-based features provide authority
– Heavy investment in computational power, e.g. real-time
indexing and instant search
• Remaining challenges are not computational, but in
modeling user cognition
– Need a deeper understanding of the query, the content and/or
the world at large
– Could Watson explain why the answer is Toronto?

-6-

Poorly solved information needs
• Multiple interpretations
– paris hilton
• Long tail queries Many of these queries
would not be asked by
– george bush (and I mean the beer brewer in Arizona)
users, who learned over
• Multimedia search time what search
– paris hilton sexy technology can and can
not do.
• Imprecise or overly precise searches
– jim hendler
– pictures of strong adventures people
• Searches for descriptions
– countries in africa
– 32 year old computer scientist living in barcelona
– reliable digital camera under 300 dollars

-7-

Example: multiple interpretations

-8-

Why Semantic Search? Part II

• The Semantic Web is now a reality
– Large amounts of RDF data
– Heterogeneous schemas, quality
– Users who are not skilled in writing
complex queries (e.g. SPARQL)
and may not be experts in the
domain

• Searching data instead or in
addition to searching documents
– Direct answers
– Novel search tasks

-9-

Example: direct answers in search

Points of Faceted
interest in Information
search for Information box
Vienna, from the Shopping with content from
Austria Knowledgeresults and links to Yahoo!
Graph Travel
Since Aug,
2010, ‘regular’
search results
are ‘Powered
by Bing’

- 10 -

Novel search tasks

• Aggregation of search results
– e.g. price comparison across websites
• Analysis and prediction
– e.g. world temperature by 2020
• Semantic profiling
– Ontology-based modeling of user interests
• Semantic log analysis
– Linking query and navigation logs to ontologies
• Support for complex tasks (search apps)
– e.g. booking a vacation using a combination of services

- 11 -

Contextual (pervasive, ambient) search
Yahoo! Connected
TV:
Widget engine
embedded into the
TV

Yahoo! IntoNow:
recognize audio and
show related content

- 12 -

Interactive search and task completion

- 13 -

Why Semantic Search? Part III

• There is a use case
– Consumers want to understand content
– Publishers want consumers to understand their content
• Semantic Web standards seem to be a good fit

http://en.wikipedia.org/wiki/Underpants_Gnomes
- 14 -

Example: rNews

• RDFa vocabulary for
news articles
– Easier to implement than
NewsML
– Easier to consume for
news search and other
readers, aggregators
• Under development at
the IPTC
– March: v0.1 approved
– Final version by Sept

- 15 -

Example: Facebook’s Like and the Open Graph Protocol

• The ‘Like’ button provides publishers with a way to promote
their content on Facebook and build communities
– Shows up in profiles and news feed
– Site owners can later reach users who have liked an object
– Facebook Graph API allows 3rd party developers to access the
data
• Open Graph Protocol is an RDFa-based format that allows
to describe the object that the user ‘Likes’

- 16 -

Example: Facebook’s Open Graph Protocol

• RDF vocabulary to be used in conjunction with RDFa
– Simplify the work of developers by restricting the freedom in RDFa
• Activities, Businesses, Groups, Organizations, People, Places,
Products and Entertainment
• Only HTML <head> accepted
• http://opengraphprotocol.org/
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:title" content="The Rock" />
<meta property="og:type" content="movie" />
<meta property="og:url"
content="http://www.imdb.com/title/tt0117500/" />
<meta property="og:image" content="http://ia.media-
imdb.com/images/rock.jpg" /> …
</head> ... - 17 -

Example: schema.org

• Agreement on a shared set of schemas for common types of
web content
– Bing, Google, and Yahoo! as initial supporters
– Similar in intent to sitemaps.org (2006)
• Use a single format to communicate the same information to all
three search engines
• Support for microdata
• schema.org covers areas of interest to all search engines
– Business listings (local), creative works (video), recipes,
reviews
– User defined extensions
• Each search engine continues to develop its products

- 18 -

Documentation and OWL ontology

- 19 -

Current state of metadata on the Web

• 31% of webpages, 5% of domains contain some
metadata
– Analysis of the Bing Crawl (US crawl, January, 2012)
– RDFa is most common format
• By URL: 25% RDFa, 7% microdata, 9% microformat
• By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
– Adoption is stronger among large publishers
• Especially for RDFa and microdata
• See also
– P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
LDOW 2012
– H.Mühleisen, C.Bizer.Web
Data Commons - Extracting Structured Data from Two Large Web Corpo
, LDOW 2012

- 20 -

Exponential growth in RDFa data

Another five-fold increase
between October 2010 and
January, 2012

Five-fold increase
between March, 2009 and
October, 2010

Percentage of URLs with embedded metadata in various formats
- 21 -

Semantic Search: a definition

• Semantic search is a retrieval paradigm that
– Makes use of the structure of the data or explicit schemas
to understand user intent and the meaning of content
– Exploits this understanding at some part of the search
process
• Web search vs. vertical/enterprise/desktop search
• Related fields:
– XML retrieval
– Keyword search in databases
– Natural Language Retrieval

- 23 -

Semantics at every step of the IR process

bla bla bla? The IR engine The Web

Query interpretation

q=“bla” * 3

Crawling and indexing bla

bla bla
Indexing
θ(q,d) “bla”
Ranking

bla bla
bla
bla bla bla Result presentation
- 24 -

Data on the Web

• Most web pages on the Web are generated from structured
data
– Data is stored in relational databases (typically)
– Queried through web forms
– Presented as tables or simply as unstructured text
• The structure and semantics (meaning) of the data is not
directly accessible to search engines
• Two solutions
– Extraction using Information Extraction (IE) techniques
(implicit metadata)
• Supervised vs. unsupervised methods
– Relying on publishers to expose structured data using standard
Semantic Web formats (explicit metadata)
• Particularly interesting for long tail content
- 26 -

Information Extraction methods

• Natural Language Processing
• Extraction of triples
– Suchanek et al. YAGO: A Core of Semantic Knowledge
Unifying WordNet and Wikipedia, WWW, 2007.
– Wu and Weld. Autonomously Semantifying Wikipedia, CIKM
2007.
• Filling web forms automatically (form-filling)
– Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
• Extraction from HTML tables
– Cafarella et al. WebTables: Exploring the Power of Tables on
the Web. VLDB 2008
• Wrapper induction
– Kushmerick et al. Wrapper Induction for Information
ExtractionText extraction. IJCAI 2007
- 27 -

Semantic Web

• Sharing data across the Web
– Publish information in standard formats (RDF, RDFa)
– Share the meaning using powerful, logic-based languages
(OWL, RIF)
– Query using standard languages and protocols (HTTP, SPARQL)
• Two main forms of publishing
– Linked Data
• Data published as RDF documents linked to other RDF documents
and/or using SPARQL end-points
• Community effort to re-publish large public datasets (e.g. Dbpedia,
open government data)
– RDFa
• Data embedded inside HTML pages
• Recommended for site owners by Yahoo, Google, Facebook
- 28 -

Crawling the Semantic Web

• Linked Data
– Similar to HTML crawling, but the the crawler needs to parse
RDF/XML (and others) to extract URIs to be crawled
– Semantic Sitemap/VOID descriptions
• RDFa
– Same as HTML crawling, but data is extracted after crawling
– Mika et al. Investigating the Semantic Gap through Query Log
Analysis, ISWC 2010.
• SPARQL endpoints
– Endpoints are not linked, need to be discovered by other
means
– Semantic Sitemap/VOID descriptions

- 29 -

Data fusion
• Ontology matching
– Widely studied in Semantic Web research, see e.g. list of
publications at ontologymatching.org
• Unfortunately, not much of it is applicable in a Web context due to the
quality of ontologies
• Entity resolution
– Logic-based approaches in the Semantic Web
– Studied as record linkage in the database literature
• Machine learning based approaches, focusing on attributes
– Graph-based approaches, see e.g. the work of Lisa Getoor are
applicable to RDF data
• Improvements over only attribute based matching
• Blending
– Merging objects that represent the same real world entity and
reconciling information from multiple sources

- 30 -

Data quality assessment and curation
• Heterogeneity, quality of data is an even larger issue
– Quality ranges from well-curated data sets (e.g. Freebase) to
microformats
• In the worst of cases, the data becomes a graph of words
– Short amounts of text: prone to mistakes in data entry or extraction
• Example: mistake in a phone number or state code
• Quality assessment and data curation
– Quality varies from data created by experts to user-generated
content
– Automated data validation
• Against known-good data or using triangulation
• Validation against the ontology or using probabilistic models
– Data validation by trained professionals or crowdsourcing
• Sampling data for evaluation
– Curation based on user feedback

- 31 -

Indexing

• Search requires matching and ranking
– Matching selects a subset of the elements to be scored
• The goal of indexing is to speed up matching
– Retrieval needs to be performed in milliseconds
– Without an index, retrieval would require streaming through the
collection
• The type of index depends on the query model to support
– DB-style indexing
– IR-style indexing

- 32 -

IR-style indexing

• Index data as text
– Create virtual documents from data
– One virtual document per subgraph, resource or triple
• typically: resource
• Key differences to Text Retrieval
– RDF data is structured
– Minimally, queries on property values are required

- 33 -

Horizontal index structure

• Two fields (indices): one for terms, one for properties
• For each term, store the property on the same position in the
property index
– Positions are required even without phrase queries
• Query engine needs to support the alignment operator
• ✓ Dictionary is number of unique terms + number of
properties
• Occurrences is number of tokens * 2

- 34 -

Vertical index structure

• One field (index) per property
• Positions are not required
– But useful for phrase queries
• Query engine needs to support fields
• Dictionary is number of unique terms
• Occurrences is number of tokens
• ✗ Number of fields is a problem for merging, query performance

- 35 -

Distributed indexing

• MapReduce is ideal for building inverted indices
– Map creates (term, {doc1}) pairs
– Reduce collects all docs for the same term: (term, {doc1,
doc2…}
– Sub-indices are merged separately
• Term-partitioned indices
• Peter Mika. Distributed Indexing for Semantic Search,
SemSearch 2010.

- 36 -

What is search?

• The search problem
– A data collection consisting of a set of items (units of retrieval)
– Information needs expressed as queries
– Ambiguity in the interpretation of the data and/or the queries
• Search is the task of efficiently finding items that are relevant
to the information need
– Query processing mainly focuses on efficiency of matching
whereas ranking deals with degree of matching (relevance)!

- 38 -

Types of collections and query paradigms

• Types of collections
– Structured data with well defined schemas
– Semi-structured data with incomplete or no schemas
– Data that largely comprise text
– Hybrid / embedded data
• Units of retrieval
– Subgraphs, triples, resources etc.
• Query paradigms
– Natural language texts and keywords
– Form-based inputs
– Formal structured queries

- 39 -

Types of data models (1)

• Textual
– Bag-of-words
– Represent documents, text in structured data,…, real-world
objects (captured as structured data)
– Lacks “structure”
• Text structure, e.g. linguistic structure, outlines, hyperlinks etc.
• Structure in structured data representation

term (statistics)
In combination with
combination
Cloud Computing
Cloud
technologies, promising
Computing
solutions for the
Technologies
management of `big
solutions
data' have emerged.
management
Existing industry
`big data'
solutions are able to
industry
support complex
solutions
queries and analytics
support
tasks with terabytes of
complex
data. For example,
……
using a Greenplum.

- 40 -


• Graph structure
– Relationships in the data
• Hyperlinks
• Typed relationships
– Ontology
creator Picture
Person

Bob

- 41 -


• Hybrid
– RDF data embedded in text (RDFa)

- 42 -

Formalisms for querying semantic data (1)

Example information need
“Information about a friend of Alice, who shared
an apartment with her in Berlin and knows
someone working at KIT.”

- 43 -


• Unstructured
– NL
– Keywords
shared apartment Berlin Alice

- 44 -


• Fully-structured
– SPARQL: BGP, filter, optional, union, select, construct, ask,
describe
• PREFIX ns: <http://example.org/ns#>
SELECT ?x
WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v. ?v ns:name
“KIT” }
- 45 -


• Hybrid: both content and structure constraints

“shared apartment Berlin Alice”

?x ns:knows ? y. ?y ns:name “Alice”.
?x ns:knows ?z. ?z ns: works ?v.
?v ns:name “KIT”

- 46 -

Summary: data and queries in Semantic Search
NL Form- / facet- Structured Queries
Keywords
Questions based Inputs (SPARQL)
Ambiquities

Query
Semantic Search target
different group of users,
information needs, and types
of data. Query processing for
semantic search is hybrid
combination of techniques!
Data

RDF data Semi- OWL ontologies with
Structured
embedded in Structured rich, formal
RDF data
text (RDFa) RDF data semantics
Ambiquities: confidence degree, truth/trust
- 47 -

Processing hybrid graph patterns (1)
“Information about a friend of Alice, who shared an apartment with
her in Berlin and knows someone working at KIT.”

apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

?y ns:name “Alice”. ?x ns:knows ? y
works age
trouble with bob FluidOps 34
Peter
sunset.jpg

au
Bob is a good friend
title Beautiful

th
of mine. We went to

or
Sunset Semantic
the same university, Germany

c re
Alice Search
and also shared an

at o
r or
apartment in Berlin creato

ye
th

r
au
kn

ar
in 2008. The trouble
ow

Germany 2009
s

with Bob is that he Bob
takes much better knows Thanh
wo

d
photos than I do:

e
rk

at
s

c
KIT

lo
- 48 -

Matching keyword query against text

• Retrieve documents
• Inverted list (inverted index)
– keyword  {<doc1, pos, score>,…,<doc2, pos, score, ...>, ...}
• AND-semantics: top-k join

shared Berlin Alice shared Berlin Alice

D1 D1 D1

shared = berlin = alice

shared

- 49 -

Matching structured query against structured data
• Retrieve data for triple patterns
• Index on tables
• Multiple “redundant” indexes to cover different access patterns
• Join (conjunction of triples)
• Blocking, e.g. linear merge join (required sorted input)
• Non-blocking, e.g. symmetric hash-join
• Materialized join indexes

?x ns:knows ?y. ?x ns:knows ?z.
Per1 ns:works ?v ?v ns:name “KIT”
SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT”

=
= =

Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1Ins1 ns:name KIT
- 50 -

Matching keyword query against structured data

• Retrieve keyword elements
• Using inverted index
– keyword  {<el1, score, ...>, <el2, score, ...>,…}
• Exploration / “Join”
• Data indexes for triple lookup
• Materialized index (paths up to graphs)
• Top-k Steiner tree search, top-k subgraph exploration
Alice Bob KIT Alice Bob KIT

↔ ↔

=
=
Alice ns:knows Bob Inst1 ns:name KIT
Bob ns:works Inst1

- 51 -

Matching structured query against text
• Offilne IE
• Online IE, i.e., “retrieve “ is as follows
• Derive keywords to retrieve relevant documents
• On-the-fly information extraction, i.e., phrase pattern matching “X
name Y”
• Retrieve extracted data for structured part
• Retrieve documents for derived text patterns, e.g. sequence,
windows, reg. exp.
?x ns:knows ?y. ?x ns:knows ?z.
?z ns: works ?v. ?v ns:name “KIT”

knows

name KIT

- 52 -

Matching structured query against text

• Index
• Inverted index for document retrieval and pattern matching
• Join index  inverted index for storing materialized joins
between keywords
• Neighborhood indexes for phrase patterns

?x ns:knows ?y. ?x ns:knows ?
?z ns: works ?v. ?v ns:name “K
KIT name
knows

name KIT

- 53 -

Query processing – main tasks
• Retrieval
– Documents , data elements, triples,
Query paths, graphs
– Inverted index,…, but also other (B+ tree)
– Index documents, triples, materialized
paths
• Join
– Different join implementations, efficiency
depends on availability of indexes
– Non-blocking join good for early result
Data reporting and for “unpredictable” Linked
Data / data streams scenario

- 54 -

Query processing – more tasks
• More complex queries: disjunction,
aggregation, grouping, analytics…
Query • Join order optimization
• Approximate
– Approximate the search space
– Approximate the results (matching, join)
• Parallelization
• Top-k
– Use only some entries in the input
streams to produce k results
Data
• Multiple sources
– Federation, routing
– On-the-fly mapping, similarity join
• Hybrid
– Join text and data
- 55 -

Query processing on the Web -
research challenges and opportunities

• Large amount of
semantic data
• Optimization,
• Data inconsistent,
parallelization
redundant, and low
quality
• Approximation
• Hybrid querying and data
• Large amount of data
management
embedded in text
• Federation, routing
• Large amount of sources
• Online schema mappings
• Large amount of links
• Similarity join
between sources

- 56 -

Ranking – problem definition

Query • Ambiguities arise when
representation is incomplete /
imprecise
• Ambiguities at the level of
• elements (content ambiguity)
• structure between elements
(structure ambiguity)

Data

Due to ambiguities in the representation of the
information needs and the underlying resources, the
results cannot be guaranteed to exactly match the
query. Ranking is the problem of determining the degree
of matching using some -notions of relevance.
58 -

Content ambiguity

works age
Peter
sunset.jpg

au
title Beautiful

t ho
of mine. We went to Sunset

r
the same university, Germany Semantic

cre
Alice Search
and also shared an

at o
r or

ye
th

r
au

kn

ar

ow
Germany 2009

s
wo

d
photos than I do:

te
rk
s

ca
KIT

lo
What is meant by “Berlin” in the query? What is meant by “KIT” in the query?
What is meant by “Berlin” in the data? What is meant by “KIT” in the data?
A city with the name Berlin? a person? A research group? a university? a location?

- 59 -

Structure ambiguity

works age
Peter
sunset.jpg

au
title Beautiful

t ho
of mine. We went to Sunset

r
the same university, Germany Semantic

cre
Alice Search
and also shared an

at o
r or

ye
th

r
au

kn

ar

ow
Germany 2009

s
wo

d
photos than I do:

te
rk
s

ca
KIT

lo
What is the connection between What is meant by “works”?
“Berlin” and “Alice”? Works at? employed?
Friend? Co-worker?

- 60 -

Ambiguity

• Ambiguities arise when data or query allow for multiple
interpretations, i.e. multiple matches
– Syntactic, e.g. works vs. works at
– Semantic, e.g. works vs. employ
• “Aboutness”, i.e., contain some elements which represent the
correct interpretation
– Ambiguities arise when matching elements of different granularities
– Does i contains the interpretation for j, given some part(s) of i
(syntactically/semantically) match j
– E.g. Berlin vs. “…we went to the same university, and also, we shared
an apartment in Berlin in 2008…”
• Strictly speaking, ranking is performed after syntactic / semantic
matching is done!

- 61 -

Features: What to use to deal with ambiguities?

What is meant by “Berlin”? What is the
connection between “Berlin” and “Alice”?
• Content features
– Frequencies of terms: d more likely to be “about” a query term k
when d more often, mentions k (probabilistic IR)
– Co-occurrences: terms K that often co-occur form a contextual
interpretation, i.e., topics (cluster hypothesis)
• Structure features
– Consider relevance at level of fields
– Linked-based popularity

- 62 -

Ranking paradigms

• Explicit relevance model
– Foundation: probability ranking principle
– Ranking results by the posterior probability (odds) of being
observed in the relevant class:
– P(w|R) varies in different approaches
• binary independence model
• Two-Poisson model
• BM25

P(D|R)
P(D/N)

P ( D | R) = ∏ P( w | R)∏ (1 − P( w | N ))
w∈D w∉D

- 63 -

Ranking paradigms

• No explicit notion of relevance: similarity between the query
and the document model
– Vector space model (cosine similarity)
– Language models (KL divergence)

Sim( q, d ) = Cos ( ( w1,d ,..., wt ,d ), ( w1, q ,..., wk ,q ))

P (t | θq )
Sim( q, d ) = −KL (θq ||θ d ) = −∑P (t | θq ) log(
t∈V P (t | θd )

- 64 -

Model construction

• How to obtain
– Relevance models?
– Weights for query / document terms?
– Language models for document / queries?

- 65 -

Content-based features

• Document statistics, e.g. • An object is more likely
– Term frequency about “Berlin” when
– Document length • it contains a relatively
• Collection statistics, e.g. high number of
– Inverse document frequency
mentions of the term
“Berlin”
– Background language models
• the number of
mentions of this term in
tf the overall collection is
wt , d = ∗idf relatively low
|d |
tf
P (t | θd ) = λ + (1 − λ) P (t | C )
|d |
- 66 -

Structure-based features

• Consider structure of objects
– Content-based features for structured objects, documents and
for general tuples

P (t | θd ) = ∑α
f ∈Fd
f P (t | θ f )

• An object is more likely about “Berlin when
• one of its (important) fields contains a relatively high
number of mentions of the term “Berlin”
- 67 -

Structure-based features (2)

• PageRank
– Link analysis algorithm
– Measuring relative importance of nodes
– Link counts as a vote of support
– The PageRank of a node recursively depends on the number
and PageRank of all nodes that link to it (incoming links)
• ObjectRank
– Types and semantics of links vary in structured data setting
– Authority transfer schema graph specifies connection strengths
– Recursively compute authority transfer data graph

An object about “Berlin” is more important than another when
• a relatively large number of objects are linked to it
- 68 -

In practice

• Many more aspects of relevance
– User profiles
– History
– Context, e.g. geo-location
– etc.
• Combination of features using Machine Learning
– Several hundred features in modern search engines
• Pre-compute static features such as PageRank/ObjectRank
• Two-phase scoring for efficiency
– Round 1: easy to compute features
– Round 2: more expensive features

- 69 -

Evaluation
Harry Halpin, Daniel Herzig, Peter Mika, Jeff Pound,
Henry Thompson, Roi Blanco, Thanh Tran Duc

Semantic Search challenge (2010/2011)

• Two tasks
– Entity Search
• Queries where the user is looking for a single real world object
• Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW
2010.
– List search (new in 2011)
• Queries where the user is looking for a class of objects
• Billion Triples Challenge 2009 dataset
• Evaluated using Amazon’s Mechanical Turk
– Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
– Blanco et al. Repeatable and Reliable Search System
Evaluation using Crowd-Sourcing, SIGIR2011

- 71 -

Evaluation form

- 72 -

Other evaluations

• TREC Entity Track
– Related Entity Finding
• Entities related to a given entity through a particular relationship
• Retrieval over documents (ClueWeb 09 collection)
• Example: (Homepages of) airlines that fly Boeing 747
– Entity List Completion
• Given some elements of a list of entities, complete the list
• Question Answering over Linked Data
– Retrieval over specific datasets (Dbpedia and MusicBrainz)
– Full natural language questions of different forms
– Correct results defined by an equivalent SPARQL query
– Example: Give me all actors starring in Batman Begins.

- 73 -

Search interface
• Input and output functionality
– helping the user to formulate complex queries
– presenting the results in an intelligent manner
• Semantic Search brings improvements in
– Query formulation
– Snippet generation
– Suggesting related entities
– Adaptive and interactive presentation
• Presentation adapts to the kind of query and results presented
• Object results can be actionable, e.g. buy this product
– Aggregated search
• Grouping similar items, summarizing results in various ways
• Filtering (facets), possibly across different dimensions
– Task completion
• Help the user to fulfill the task by placing the query in a task context

- 75 -

Query formulation

• “Snap-to-grid”: suggest the most likely interpretation of
the query
– Given the ontology or a summary of the data
– While the user is typing or after issuing the query
– Example: Freebase suggest, TrueKnowledge

- 76 -

Enhanced results/Rich Snippets

– Use mark-up from the webpage to generate search snippets
• Originally invented at Yahoo! (SearchMonkey)
– Google, Yahoo!, Bing, Yandex now consume schema.org
markup
• Validators available from Google and Bing

- 77 -

Other result presentation tasks

• Select the most relevant resources within an
RDF document
– Penin et al. Snippet Generation for Semantic Web Search
Engines, ASWC 2010
• For each resource, rank the properties to be displayed
• Natural Language Generation (NLG)
– Verbalize, explain results

- 78 -

Aggregated search: facets

- 79 -

Aggregated search: Sig.ma

- 80 -

Related entities

Related actors
and movies

- 81 -

Adaptive presentation:
housing search

- 82 -

Resources
• Books
– Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
Information Retrieval. ACM Press. 2011
• Survey papers
– Thanh Tran, Peter Mika. Survey of Semantic Search Approaches.
Under submission, 2012.
• Conferences and workshops
– ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
– Semantic Search workshop series
– Exploiting Semantic Annotations in Information Retrieval (ESAIR)
– Entity-oriented Search (EOS) workshop
• Upcoming
– Joint Intl. Workshop on Entity-oriented and Semantic Search
(JIWES) at SIGIR 2012
– ESAIR 2012 at CIKM 2012

- 83 -

The End

• Many thanks to Thanh Tran (KIT) and members of the
SemSearch group at Yahoo! Research in Barcelona
• Contact
– pmika@yahoo-inc.com
– Internships available for PhD students (deadline in January)

- 84 -

Semantic Search overview at SSSW 2012

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (20)

Semelhante a Semantic Search overview at SSSW 2012

Semelhante a Semantic Search overview at SSSW 2012 (20)

Mais de Peter Mika

Mais de Peter Mika (10)

Último

Último (20)

Semantic Search overview at SSSW 2012

Notas do Editor