SlideShare uma empresa Scribd logo
1 de 84
Semantic Search
                              Peter Mika
              Senior Research Scientist
                       Yahoo! Research


With contributions from Thanh Tran (KIT)
Yahoo! serves over 700 million users in 25 countries




                             -2-
Yahoo! Research: visit us at research.yahoo.com




                            -3-
Yahoo! Research Barcelona

  •   Established January, 2006
  •   Led by Ricardo Baeza-Yates
  •   Research areas
      – Web Mining
          • content, structure, usage
      – Social Media
      – Distributed Systems
      – Semantic Search




                                        -4-
Search is really fast, without necessarily being intelligent




                              -5-
Why Semantic Search? Part I

   • Improvements in IR are harder and harder to come by
      – Machine learning using hundreds of features
          • Text-based features for matching
          • Graph-based features provide authority
      – Heavy investment in computational power, e.g. real-time
        indexing and instant search
   • Remaining challenges are not computational, but in
     modeling user cognition
      – Need a deeper understanding of the query, the content and/or
        the world at large
      – Could Watson explain why the answer is Toronto?




                                     -6-
Poorly solved information needs
   •   Multiple interpretations
        – paris hilton
   •   Long tail queries                             Many of these queries
                                                     would not be asked by
        – george bush (and I mean the beer brewer in Arizona)
                                                        users, who learned over
   •   Multimedia search                                time what search
        – paris hilton sexy                             technology can and can
                                                        not do.
   •   Imprecise or overly precise searches
        – jim hendler
        – pictures of strong adventures people
   •   Searches for descriptions
        – countries in africa
        – 32 year old computer scientist living in barcelona
        – reliable digital camera under 300 dollars


                                       -7-
Example: multiple interpretations




                             -8-
Why Semantic Search? Part II

   • The Semantic Web is now a reality
      – Large amounts of RDF data
      – Heterogeneous schemas, quality
      – Users who are not skilled in writing
        complex queries (e.g. SPARQL)
        and may not be experts in the
        domain


   • Searching data instead or in
     addition to searching documents
      – Direct answers
      – Novel search tasks


                                    -9-
Example: direct answers in search




        Points of                        Faceted
        interest in             Information
                                         search for   Information box
        Vienna,                 from the Shopping     with content from
        Austria                 Knowledgeresults      and links to Yahoo!
                                Graph                 Travel
                                            Since Aug,
                                            2010, ‘regular’
                                            search results
                                            are ‘Powered
                                            by Bing’




                            - 10 -
Novel search tasks

   • Aggregation of search results
      – e.g. price comparison across websites
   • Analysis and prediction
      – e.g. world temperature by 2020
   • Semantic profiling
      – Ontology-based modeling of user interests
   • Semantic log analysis
      – Linking query and navigation logs to ontologies
   • Support for complex tasks (search apps)
      – e.g. booking a vacation using a combination of services




                                   - 11 -
Contextual (pervasive, ambient) search
                                           Yahoo! Connected
                                           TV:
                                           Widget engine
                                           embedded into the
                                            TV




                   Yahoo! IntoNow:
                   recognize audio and
                   show related content




                                  - 12 -
Interactive search and task completion




                            - 13 -
Why Semantic Search? Part III

   • There is a use case
      – Consumers want to understand content
      – Publishers want consumers to understand their content
   • Semantic Web standards seem to be a good fit




                   http://en.wikipedia.org/wiki/Underpants_Gnomes
                                      - 14 -
Example: rNews

  • RDFa vocabulary for
    news articles
     – Easier to implement than
       NewsML
     – Easier to consume for
       news search and other
       readers, aggregators
  • Under development at
    the IPTC
     – March: v0.1 approved
     – Final version by Sept




                                  - 15 -
Example: Facebook’s Like and the Open Graph Protocol

   • The ‘Like’ button provides publishers with a way to promote
     their content on Facebook and build communities
      – Shows up in profiles and news feed
      – Site owners can later reach users who have liked an object
      – Facebook Graph API allows 3rd party developers to access the
        data
   • Open Graph Protocol is an RDFa-based format that allows
     to describe the object that the user ‘Likes’




                                  - 16 -
Example: Facebook’s Open Graph Protocol

  •   RDF vocabulary to be used in conjunction with RDFa
       – Simplify the work of developers by restricting the freedom in RDFa
  •   Activities, Businesses, Groups, Organizations, People, Places,
      Products and Entertainment
  •   Only HTML <head> accepted
  •   http://opengraphprotocol.org/
 <html xmlns:og="http://opengraphprotocol.org/schema/">
 <head>
    <title>The Rock (1996)</title>
    <meta property="og:title" content="The Rock" />
    <meta property="og:type" content="movie" />
    <meta property="og:url"
    content="http://www.imdb.com/title/tt0117500/" />
    <meta property="og:image" content="http://ia.media-
    imdb.com/images/rock.jpg" /> …
 </head> ...                        - 17 -
Example: schema.org

  • Agreement on a shared set of schemas for common types of
    web content
     – Bing, Google, and Yahoo! as initial supporters
     – Similar in intent to sitemaps.org (2006)
         • Use a single format to communicate the same information to all
           three search engines
  • Support for microdata
  • schema.org covers areas of interest to all search engines
     – Business listings (local), creative works (video), recipes,
       reviews
     – User defined extensions
  • Each search engine continues to develop its products


                                    - 18 -
Documentation and OWL ontology




                         - 19 -
Current state of metadata on the Web

   • 31% of webpages, 5% of domains contain some
     metadata
       – Analysis of the Bing Crawl (US crawl, January, 2012)
       – RDFa is most common format
   • By URL: 25% RDFa, 7% microdata, 9% microformat
   • By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat
       – Adoption is stronger among large publishers
   • Especially for RDFa and microdata
   •   See also
        – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus,
          LDOW 2012
        – H.Mühleisen, C.Bizer.Web
           Data Commons - Extracting Structured Data from Two Large Web Corpo
          , LDOW 2012


                                         - 20 -
Exponential growth in RDFa data

                      Another five-fold increase
                      between October 2010 and
                      January, 2012



                  Five-fold increase
                  between March, 2009 and
                  October, 2010




     Percentage of URLs with embedded metadata in various formats
                                    - 21 -
Semantic Search
Semantic Search: a definition

   • Semantic search is a retrieval paradigm that
      – Makes use of the structure of the data or explicit schemas
        to understand user intent and the meaning of content
      – Exploits this understanding at some part of the search
        process
   • Web search vs. vertical/enterprise/desktop search
   • Related fields:
      – XML retrieval
      – Keyword search in databases
      – Natural Language Retrieval




                                 - 23 -
Semantics at every step of the IR process

               bla bla bla?             The IR engine            The Web

                              Query interpretation

                     q=“bla” * 3

                                                       Crawling and indexing   bla

                                                                               bla bla
Indexing
           θ(q,d)                   “bla”
Ranking




     bla             bla
                       bla
     bla bla         bla           Result presentation
                                              - 24 -
Crawling and
    Indexing
Data on the Web

  • Most web pages on the Web are generated from structured
    data
     – Data is stored in relational databases (typically)
     – Queried through web forms
     – Presented as tables or simply as unstructured text
  • The structure and semantics (meaning) of the data is not
    directly accessible to search engines
  • Two solutions
     – Extraction using Information Extraction (IE) techniques
       (implicit metadata)
         • Supervised vs. unsupervised methods
     – Relying on publishers to expose structured data using standard
       Semantic Web formats (explicit metadata)
         • Particularly interesting for long tail content
                                       - 26 -
Information Extraction methods

   • Natural Language Processing
   • Extraction of triples
      – Suchanek et al. YAGO: A Core of Semantic Knowledge
        Unifying WordNet and Wikipedia, WWW, 2007.
      – Wu and Weld. Autonomously Semantifying Wikipedia, CIKM
        2007.
   • Filling web forms automatically (form-filling)
      – Madhavan et al. Google's Deep-Web Crawl. VLDB 2008
   • Extraction from HTML tables
      – Cafarella et al. WebTables: Exploring the Power of Tables on
        the Web. VLDB 2008
   • Wrapper induction
      – Kushmerick et al. Wrapper Induction for Information
        ExtractionText extraction. IJCAI 2007
                                    - 27 -
Semantic Web

  • Sharing data across the Web
     – Publish information in standard formats (RDF, RDFa)
     – Share the meaning using powerful, logic-based languages
       (OWL, RIF)
     – Query using standard languages and protocols (HTTP, SPARQL)
  • Two main forms of publishing
     – Linked Data
        • Data published as RDF documents linked to other RDF documents
          and/or using SPARQL end-points
        • Community effort to re-publish large public datasets (e.g. Dbpedia,
          open government data)
     – RDFa
        • Data embedded inside HTML pages
        • Recommended for site owners by Yahoo, Google, Facebook
                                    - 28 -
Crawling the Semantic Web

  • Linked Data
     – Similar to HTML crawling, but the the crawler needs to parse
       RDF/XML (and others) to extract URIs to be crawled
     – Semantic Sitemap/VOID descriptions
  • RDFa
     – Same as HTML crawling, but data is extracted after crawling
     – Mika et al. Investigating the Semantic Gap through Query Log
       Analysis, ISWC 2010.
  • SPARQL endpoints
     – Endpoints are not linked, need to be discovered by other
       means
     – Semantic Sitemap/VOID descriptions


                                 - 29 -
Data fusion
   •   Ontology matching
        – Widely studied in Semantic Web research, see e.g. list of
          publications at ontologymatching.org
            • Unfortunately, not much of it is applicable in a Web context due to the
              quality of ontologies
   •   Entity resolution
        – Logic-based approaches in the Semantic Web
        – Studied as record linkage in the database literature
            • Machine learning based approaches, focusing on attributes
        – Graph-based approaches, see e.g. the work of Lisa Getoor are
          applicable to RDF data
            • Improvements over only attribute based matching
   •   Blending
        – Merging objects that represent the same real world entity and
          reconciling information from multiple sources



                                          - 30 -
Data quality assessment and curation
   •   Heterogeneity, quality of data is an even larger issue
        – Quality ranges from well-curated data sets (e.g. Freebase) to
          microformats
            • In the worst of cases, the data becomes a graph of words
        – Short amounts of text: prone to mistakes in data entry or extraction
            • Example: mistake in a phone number or state code
   •   Quality assessment and data curation
        – Quality varies from data created by experts to user-generated
          content
        – Automated data validation
            • Against known-good data or using triangulation
            • Validation against the ontology or using probabilistic models
        – Data validation by trained professionals or crowdsourcing
            • Sampling data for evaluation
        – Curation based on user feedback


                                          - 31 -
Indexing

   • Search requires matching and ranking
      – Matching selects a subset of the elements to be scored
   • The goal of indexing is to speed up matching
      – Retrieval needs to be performed in milliseconds
      – Without an index, retrieval would require streaming through the
        collection
   • The type of index depends on the query model to support
      – DB-style indexing
      – IR-style indexing




                                   - 32 -
IR-style indexing

   • Index data as text
      – Create virtual documents from data
      – One virtual document per subgraph, resource or triple
          • typically: resource
   • Key differences to Text Retrieval
      – RDF data is structured
      – Minimally, queries on property values are required




                                   - 33 -
Horizontal index structure

  • Two fields (indices): one for terms, one for properties
  • For each term, store the property on the same position in the
    property index
     – Positions are required even without phrase queries
  • Query engine needs to support the alignment operator
  • ✓ Dictionary is number of unique terms + number of
    properties
  • Occurrences is number of tokens * 2




                                  - 34 -
Vertical index structure

  •   One field (index) per property
  •   Positions are not required
       – But useful for phrase queries
  •   Query engine needs to support fields
  •   Dictionary is number of unique terms
  •   Occurrences is number of tokens
  •   ✗ Number of fields is a problem for merging, query performance




                                         - 35 -
Distributed indexing

   • MapReduce is ideal for building inverted indices
      – Map creates (term, {doc1}) pairs
      – Reduce collects all docs for the same term: (term, {doc1,
        doc2…}
      – Sub-indices are merged separately
          • Term-partitioned indices
   • Peter Mika. Distributed Indexing for Semantic Search,
     SemSearch 2010.




                                       - 36 -
Query Processing
What is search?

   • The search problem
      – A data collection consisting of a set of items (units of retrieval)
      – Information needs expressed as queries
      – Ambiguity in the interpretation of the data and/or the queries
   • Search is the task of efficiently finding items that are relevant
     to the information need
      – Query processing mainly focuses on efficiency of matching
        whereas ranking deals with degree of matching (relevance)!




                                     - 38 -
Types of collections and query paradigms

   • Types of collections
      – Structured data with well defined schemas
      – Semi-structured data with incomplete or no schemas
      – Data that largely comprise text
      – Hybrid / embedded data
   • Units of retrieval
      – Subgraphs, triples, resources etc.
   • Query paradigms
      – Natural language texts and keywords
      – Form-based inputs
      – Formal structured queries


                                    - 39 -
Types of data models (1)

   • Textual
      – Bag-of-words
      – Represent documents, text in structured data,…, real-world
        objects (captured as structured data)
      – Lacks “structure”
         • Text structure, e.g. linguistic structure, outlines, hyperlinks etc.
         • Structure in structured data representation


                 term (statistics)
                  In combination with
                  combination
                  Cloud Computing
                  Cloud
                  technologies, promising
                  Computing
                  solutions for the
                  Technologies
                  management of `big
                  solutions
                  data' have emerged.
                  management
                  Existing industry
                  `big data'
                  solutions are able to
                  industry
                  support complex
                  solutions
                  queries and analytics
                  support
                  tasks with terabytes of
                  complex
                  data. For example,
                  ……
                  using a Greenplum.




                                            - 40 -
Types of data models (2)

   • Graph structure
      – Relationships in the data
          • Hyperlinks
          • Typed relationships
      – Ontology
                                   creator Picture
                          Person




                    Bob




                                      - 41 -
Types of data models (3)

   • Hybrid
      – RDF data embedded in text (RDFa)




                                - 42 -
Formalisms for querying semantic data (1)

  Example information need
  “Information about a friend of Alice, who shared
  an apartment with her in Berlin and knows
  someone working at KIT.”




                            - 43 -
Formalisms for querying semantic data (2)

  Example information need
  “Information about a friend of Alice, who shared
  an apartment with her in Berlin and knows
  someone working at KIT.”
  • Unstructured
     – NL
     – Keywords
                     shared      apartment   Berlin   Alice




                              - 44 -
Formalisms for querying semantic data (3)

  Example information need
  “Information about a friend of Alice, who shared
  an apartment with her in Berlin and knows
  someone working at KIT.”
  • Fully-structured
     – SPARQL: BGP, filter, optional, union, select, construct, ask,
       describe
         • PREFIX ns: <http://example.org/ns#>
           SELECT ?x
           WHERE { ?x ns:knows ? y. ?y ns:name “Alice”.
                     ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name
           “KIT” }
                                   - 45 -
Formalisms for querying semantic data (4)

   • Hybrid: both content and structure constraints




                                           “shared apartment Berlin Alice”

              ?x ns:knows ? y. ?y ns:name “Alice”.
              ?x ns:knows ?z. ?z ns: works ?v.
              ?v ns:name “KIT”


                                  - 46 -
Summary: data and queries in Semantic Search
                          NL            Form- / facet-   Structured Queries
        Keywords
                       Questions        based Inputs         (SPARQL)
  Ambiquities


                                      Query
                         Semantic Search target
                         different group of users,
                      information needs, and types
                      of data. Query processing for
                        semantic search is hybrid
                      combination of techniques!
                                   Data



     RDF data             Semi-                          OWL ontologies with
                                           Structured
   embedded in          Structured                          rich, formal
                                            RDF data
    text (RDFa)         RDF data                             semantics
  Ambiquities: confidence degree, truth/trust
                                        - 47 -
Processing hybrid graph patterns (1)
  Example information need
  “Information about a friend of Alice, who shared an apartment with
  her in Berlin and knows someone working at KIT.”


 apartment shared Berlin Alice                ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                             ?y ns:name “Alice”. ?x ns:knows ? y
                                                                                works      age
                     trouble with bob                                FluidOps                          34
                                                                                     Peter
                                                 sunset.jpg




                                                                                             au
                     Bob is a good friend
                                                          title Beautiful




                                                                                                th
                     of mine. We went to




                                                                                                 or
                                                                 Sunset                     Semantic
                     the same university,                                   Germany



                                                             c re
                                                 Alice                                      Search
                     and also shared an


                                                              at o
                                                r                                     or
                     apartment in Berlin creato




                                                                                                      ye
                                                                                   th

                                                                r
                                                                                au
                                                    kn




                                                                                                      ar
                     in 2008. The trouble
                                                        ow

                                                                                         Germany 2009
                                                         s

                     with Bob is that he                     Bob
                     takes much better                           knows Thanh
                                                                          wo




                                                                                             d
                     photos than I do:




                                                                                             e
                                                                             rk




                                                                                          at
                                                                                s




                                                                                         c
                                                                                    KIT




                                                                                      lo
                                               - 48 -
Matching keyword query against text

      • Retrieve documents
         • Inverted list (inverted index)
         – keyword  {<doc1, pos, score>,…,<doc2, pos, score, ...>, ...}
      • AND-semantics: top-k join

shared         Berlin         Alice            shared   Berlin         Alice


 D1            D1            D1


  shared   =    berlin   =    alice




  shared




                                      - 49 -
Matching structured query against structured data
      •   Retrieve data for triple patterns
           • Index on tables
           • Multiple “redundant” indexes to cover different access patterns
      •   Join (conjunction of triples)
           • Blocking, e.g. linear merge join (required sorted input)
           • Non-blocking, e.g. symmetric hash-join
           • Materialized join indexes

                                                                  ?x ns:knows ?y. ?x ns:knows ?z.
Per1 ns:works ?v ?v ns:name “KIT”
      SP-index PO-index                                           ?z ns: works ?v. ?v ns:name “KIT”




                                                 =
                                       =                      =

                          Per1 ns:works Ins1 Ins1 ns:name KIT
Per1 ns:works Ins1Ins1 ns:name KIT
                                                     - 50 -
Matching keyword query against structured data

       •     Retrieve keyword elements
              • Using inverted index
              – keyword  {<el1, score, ...>, <el2, score, ...>,…}
       •     Exploration / “Join”
              • Data indexes for triple lookup
              • Materialized index (paths up to graphs)
              • Top-k Steiner tree search, top-k subgraph exploration
           Alice    Bob        KIT                  Alice     Bob       KIT

               ↔           ↔


                             =
                   =
Alice ns:knows Bob              Inst1 ns:name KIT
                Bob ns:works Inst1


                                                     - 51 -
Matching structured query against text
   •   Offilne IE
   •   Online IE, i.e., “retrieve “ is as follows
        • Derive keywords to retrieve relevant documents
        • On-the-fly information extraction, i.e., phrase pattern matching “X
            name Y”
        • Retrieve extracted data for structured part
        • Retrieve documents for derived text patterns, e.g. sequence,
            windows, reg. exp.
                                            ?x ns:knows ?y. ?x ns:knows ?z.
                                            ?z ns: works ?v. ?v ns:name “KIT”

                                        knows


                                                                name   KIT




                                        - 52 -
Matching structured query against text

   •     Index
          • Inverted index for document retrieval and pattern matching
          • Join index  inverted index for storing materialized joins
             between keywords
          • Neighborhood indexes for phrase patterns

                                                         ?x ns:knows ?y. ?x ns:knows ?
                                                         ?z ns: works ?v. ?v ns:name “K
  KIT   name
                                                     knows


                                                                             name   KIT




                                      - 53 -
Query processing – main tasks
                      • Retrieval
                         – Documents , data elements, triples,
       Query               paths, graphs
                         – Inverted index,…, but also other (B+ tree)
                         – Index documents, triples, materialized
                           paths
                      • Join
                         – Different join implementations, efficiency
                           depends on availability of indexes
                         – Non-blocking join good for early result
       Data                reporting and for “unpredictable” Linked
                           Data / data streams scenario




                           - 54 -
Query processing – more tasks
                      • More complex queries: disjunction,
                        aggregation, grouping, analytics…
      Query           • Join order optimization
                      • Approximate
                         – Approximate the search space
                         – Approximate the results (matching, join)
                      • Parallelization
                      • Top-k
                         – Use only some entries in the input
                           streams to produce k results
       Data
                      • Multiple sources
                         – Federation, routing
                         – On-the-fly mapping, similarity join
                      • Hybrid
                         – Join text and data
                           - 55 -
Query processing on the Web -
research challenges and opportunities

  • Large amount of
    semantic data
                                       • Optimization,
  • Data inconsistent,
                                         parallelization
    redundant, and low
    quality
                                       • Approximation
                                       • Hybrid querying and data
  • Large amount of data
                                         management
    embedded in text
                                       • Federation, routing
  • Large amount of sources
                                       • Online schema mappings
  • Large amount of links
                                       • Similarity join
    between sources




                              - 56 -
Ranking
Ranking – problem definition

    Query          • Ambiguities arise when
                     representation is incomplete /
                     imprecise
                   • Ambiguities at the level of
                      • elements (content ambiguity)
                      • structure between elements
                        (structure ambiguity)

     Data

     Due to ambiguities in the representation of the
     information needs and the underlying resources, the
     results cannot be guaranteed to exactly match the
     query. Ranking is the problem of determining the degree
     of matching using some -notions of relevance.
                                58 -
Content ambiguity
 apartment shared Berlin Alice                ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                             ?y ns:name “Alice”. ?x ns:knows ? y
                                                                              works      age
                     trouble with bob                                FluidOps                     34
                                                                                   Peter
                                                 sunset.jpg




                                                                                       au
                     Bob is a good friend
                                                          title Beautiful




                                                                                          t ho
                     of mine. We went to                         Sunset




                                                                                            r
                     the same university,                                   Germany         Semantic




                                                             cre
                                                 Alice                                      Search
                     and also shared an




                                                              at o
                                                r                                     or
                     apartment in Berlin creato




                                                                                                 ye
                                                                                   th




                                                               r
                                                                                au




                                                    kn




                                                                                                 ar
                     in 2008. The trouble




                                                        ow
                                                                                         Germany 2009




                                                         s
                     with Bob is that he                     Bob
                     takes much better                           knows Thanh
                                                                          wo




                                                                                       d
                     photos than I do:




                                                                                       te
                                                                             rk
                                                                             s




                                                                                     ca
                                                                                 KIT




                                                                                   lo
  What is meant by “Berlin” in the query?          What is meant by “KIT” in the query?
  What is meant by “Berlin” in the data?           What is meant by “KIT” in the data?
  A city with the name Berlin? a person?           A research group? a university? a location?




                                               - 59 -
Structure ambiguity
 apartment shared Berlin Alice                ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT”

                             ?y ns:name “Alice”. ?x ns:knows ? y
                                                                              works      age
                     trouble with bob                                FluidOps                     34
                                                                                   Peter
                                                 sunset.jpg




                                                                                       au
                     Bob is a good friend
                                                          title Beautiful




                                                                                          t ho
                     of mine. We went to                         Sunset




                                                                                            r
                     the same university,                                   Germany         Semantic




                                                             cre
                                                 Alice                                      Search
                     and also shared an




                                                              at o
                                                r                                     or
                     apartment in Berlin creato




                                                                                                 ye
                                                                                   th




                                                               r
                                                                                au




                                                    kn




                                                                                                 ar
                     in 2008. The trouble




                                                        ow
                                                                                         Germany 2009




                                                         s
                     with Bob is that he                     Bob
                     takes much better                           knows Thanh
                                                                          wo




                                                                                       d
                     photos than I do:




                                                                                       te
                                                                             rk
                                                                             s




                                                                                     ca
                                                                                 KIT




                                                                                   lo
  What is the connection between                   What is meant by “works”?
  “Berlin” and “Alice”?                            Works at? employed?
  Friend? Co-worker?



                                               - 60 -
Ambiguity

  •   Ambiguities arise when data or query allow for multiple
      interpretations, i.e. multiple matches
       – Syntactic, e.g. works vs. works at
       – Semantic, e.g. works vs. employ
  •   “Aboutness”, i.e., contain some elements which represent the
      correct interpretation
       – Ambiguities arise when matching elements of different granularities
       – Does i contains the interpretation for j, given some part(s) of i
         (syntactically/semantically) match j
       – E.g. Berlin vs. “…we went to the same university, and also, we shared
         an apartment in Berlin in 2008…”
  •   Strictly speaking, ranking is performed after syntactic / semantic
      matching is done!



                                          - 61 -
Features: What to use to deal with ambiguities?

   What is meant by “Berlin”? What is the
   connection between “Berlin” and “Alice”?
  • Content features
     – Frequencies of terms: d more likely to be “about” a query term k
       when d more often, mentions k (probabilistic IR)
     – Co-occurrences: terms K that often co-occur form a contextual
       interpretation, i.e., topics (cluster hypothesis)
  • Structure features
     – Consider relevance at level of fields
     – Linked-based popularity



                                   - 62 -
Ranking paradigms

  • Explicit relevance model
     – Foundation: probability ranking principle
     – Ranking results by the posterior probability (odds) of being
       observed in the relevant class:
     – P(w|R) varies in different approaches
         • binary independence model
         • Two-Poisson model
         • BM25


         P(D|R)
         P(D/N)

         P ( D | R) = ∏ P( w | R)∏ (1 − P( w | N ))
                     w∈D         w∉D

                                  - 63 -
Ranking paradigms

  • No explicit notion of relevance: similarity between the query
    and the document model
      – Vector space model (cosine similarity)
      – Language models (KL divergence)




   Sim( q, d ) = Cos ( ( w1,d ,..., wt ,d ), ( w1, q ,..., wk ,q ))

                                                              P (t | θq )
Sim( q, d ) = −KL (θq ||θ d ) = −∑P (t | θq ) log(
                                              t∈V             P (t | θd )


                                     - 64 -
Model construction

   • How to obtain
      – Relevance models?
      – Weights for query / document terms?
      – Language models for document / queries?




                                 - 65 -
Content-based features

   • Document statistics, e.g.     • An object is more likely
      – Term frequency             about “Berlin” when
      – Document length               • it contains a relatively
   • Collection statistics, e.g.      high number of
      – Inverse document frequency
                                      mentions of the term
                                      “Berlin”
      – Background language models
                                      • the number of
                                      mentions of this term in
               tf                     the overall collection is
      wt , d =      ∗idf              relatively low
              |d |
                  tf
 P (t | θd ) = λ      + (1 − λ) P (t | C )
                 |d |
                                 - 66 -
Structure-based features

   • Consider structure of objects
      – Content-based features for structured objects, documents and
        for general tuples




                P (t | θd ) =   ∑α
                                f ∈Fd
                                             f   P (t | θ f )



   • An object is more likely about “Berlin when
      • one of its (important) fields contains a relatively high
      number of mentions of the term “Berlin”
                                    - 67 -
Structure-based features (2)

   • PageRank
      – Link analysis algorithm
      – Measuring relative importance of nodes
      – Link counts as a vote of support
      – The PageRank of a node recursively depends on the number
        and PageRank of all nodes that link to it (incoming links)
   • ObjectRank
      – Types and semantics of links vary in structured data setting
      – Authority transfer schema graph specifies connection strengths
      – Recursively compute authority transfer data graph

 An object about “Berlin” is more important than another when
    • a relatively large number of objects are linked to it
                                   - 68 -
In practice

   • Many more aspects of relevance
      – User profiles
      – History
      – Context, e.g. geo-location
      – etc.
   • Combination of features using Machine Learning
      – Several hundred features in modern search engines
   • Pre-compute static features such as PageRank/ObjectRank
   • Two-phase scoring for efficiency
      – Round 1: easy to compute features
      – Round 2: more expensive features


                                     - 69 -
Evaluation
Harry Halpin, Daniel Herzig, Peter Mika, Jeff Pound,
Henry Thompson, Roi Blanco, Thanh Tran Duc
Semantic Search challenge (2010/2011)

   • Two tasks
      – Entity Search
          • Queries where the user is looking for a single real world object
          • Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW
            2010.
      – List search (new in 2011)
          • Queries where the user is looking for a class of objects
   • Billion Triples Challenge 2009 dataset
   • Evaluated using Amazon’s Mechanical Turk
      – Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010
      – Blanco et al. Repeatable and Reliable Search System
        Evaluation using Crowd-Sourcing, SIGIR2011


                                       - 71 -
Evaluation form




                  - 72 -
Other evaluations

   • TREC Entity Track
      – Related Entity Finding
          • Entities related to a given entity through a particular relationship
          • Retrieval over documents (ClueWeb 09 collection)
          • Example: (Homepages of) airlines that fly Boeing 747
      – Entity List Completion
          • Given some elements of a list of entities, complete the list
   • Question Answering over Linked Data
      – Retrieval over specific datasets (Dbpedia and MusicBrainz)
      – Full natural language questions of different forms
      – Correct results defined by an equivalent SPARQL query
      – Example: Give me all actors starring in Batman Begins.

                                        - 73 -
Search interface
Search interface
   •   Input and output functionality
        – helping the user to formulate complex queries
        – presenting the results in an intelligent manner
   •   Semantic Search brings improvements in
        – Query formulation
        – Snippet generation
        – Suggesting related entities
        – Adaptive and interactive presentation
            • Presentation adapts to the kind of query and results presented
            • Object results can be actionable, e.g. buy this product
        – Aggregated search
            • Grouping similar items, summarizing results in various ways
            • Filtering (facets), possibly across different dimensions
        – Task completion
            • Help the user to fulfill the task by placing the query in a task context


                                           - 75 -
Query formulation

   • “Snap-to-grid”: suggest the most likely interpretation of
     the query
      – Given the ontology or a summary of the data
      – While the user is typing or after issuing the query
      – Example: Freebase suggest, TrueKnowledge




                                 - 76 -
Enhanced results/Rich Snippets

     – Use mark-up from the webpage to generate search snippets
        • Originally invented at Yahoo! (SearchMonkey)
     – Google, Yahoo!, Bing, Yandex now consume schema.org
       markup
        • Validators available from Google and Bing




                                   - 77 -
Other result presentation tasks

   • Select the most relevant resources within an
     RDF document
      – Penin et al. Snippet Generation for Semantic Web Search
        Engines, ASWC 2010
   • For each resource, rank the properties to be displayed
   • Natural Language Generation (NLG)
      – Verbalize, explain results




                                     - 78 -
Aggregated search: facets




                            - 79 -
Aggregated search: Sig.ma




                            - 80 -
Related entities




                   Related actors
                   and movies




                         - 81 -
Adaptive presentation:
housing search




                         - 82 -
Resources
  •   Books
       – Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern
         Information Retrieval. ACM Press. 2011
  •   Survey papers
       – Thanh Tran, Peter Mika. Survey of Semantic Search Approaches.
         Under submission, 2012.
  •   Conferences and workshops
       – ISWC, ESWC, WWW, SIGIR, CIKM, SemTech
       – Semantic Search workshop series
       – Exploiting Semantic Annotations in Information Retrieval (ESAIR)
       – Entity-oriented Search (EOS) workshop
  •   Upcoming
       – Joint Intl. Workshop on Entity-oriented and Semantic Search
         (JIWES) at SIGIR 2012
       – ESAIR 2012 at CIKM 2012


                                     - 83 -
The End

  • Many thanks to Thanh Tran (KIT) and members of the
    SemSearch group at Yahoo! Research in Barcelona
  • Contact
     – pmika@yahoo-inc.com
     – Internships available for PhD students (deadline in January)




                                  - 84 -

Mais conteúdo relacionado

Mais procurados

Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through EntitiesPeter Mika
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the WebPeter Mika
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the WebRoi Blanco
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Roi Blanco
 
Smoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersSmoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersBill Slawski
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationTrey Grainger
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Roi Blanco
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011sssw2011
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and futureRoi Blanco
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic SearchRoi Blanco
 
Relational Navigation Brings Social Computing and Semantic Technology Computi...
Relational Navigation Brings Social Computing and Semantic Technology Computi...Relational Navigation Brings Social Computing and Semantic Technology Computi...
Relational Navigation Brings Social Computing and Semantic Technology Computi...Bradley Allen
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Trey Grainger
 
Search and social patents for 2012 and beyond
Search and social patents for 2012 and beyondSearch and social patents for 2012 and beyond
Search and social patents for 2012 and beyondBill Slawski
 
Semantic Search
Semantic SearchSemantic Search
Semantic Searchsssw2012
 

Mais procurados (20)

Understanding Queries through Entities
Understanding Queries through EntitiesUnderstanding Queries through Entities
Understanding Queries through Entities
 
Related Entity Finding on the Web
Related Entity Finding on the WebRelated Entity Finding on the Web
Related Entity Finding on the Web
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
From Queries to Answers in the Web
From Queries to Answers in the WebFrom Queries to Answers in the Web
From Queries to Answers in the Web
 
Mining Web content for Enhanced Search
Mining Web content for Enhanced Search Mining Web content for Enhanced Search
Mining Web content for Enhanced Search
 
Smoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papersSmoke Signals and Social Signals: A look at the patents and papers
Smoke Signals and Social Signals: A look at the patents and papers
 
Reflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital TransformationReflected Intelligence: Real world AI in Digital Transformation
Reflected Intelligence: Real world AI in Digital Transformation
 
Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations Beyond document retrieval using semantic annotations
Beyond document retrieval using semantic annotations
 
Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011Peter Mika's Presentation at SSSW 2011
Peter Mika's Presentation at SSSW 2011
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Searching over the past, present and future
Searching over the past, present and futureSearching over the past, present and future
Searching over the past, present and future
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Large-Scale Semantic Search
Large-Scale Semantic SearchLarge-Scale Semantic Search
Large-Scale Semantic Search
 
Relational Navigation Brings Social Computing and Semantic Technology Computi...
Relational Navigation Brings Social Computing and Semantic Technology Computi...Relational Navigation Brings Social Computing and Semantic Technology Computi...
Relational Navigation Brings Social Computing and Semantic Technology Computi...
 
Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)Natural Language Search with Knowledge Graphs (Haystack 2019)
Natural Language Search with Knowledge Graphs (Haystack 2019)
 
Search and social patents for 2012 and beyond
Search and social patents for 2012 and beyondSearch and social patents for 2012 and beyond
Search and social patents for 2012 and beyond
 
Semantic Search
Semantic SearchSemantic Search
Semantic Search
 

Destaque

Acuario de Gijón
 Acuario de Gijón Acuario de Gijón
Acuario de GijónVillafria
 
Trabajo historia fernando b2 d astérix
Trabajo historia fernando b2 d astérixTrabajo historia fernando b2 d astérix
Trabajo historia fernando b2 d astérixFernando_2D
 
Gordon Biersch Final Book
Gordon Biersch Final BookGordon Biersch Final Book
Gordon Biersch Final BookRamonHernandez
 
Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...
Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...
Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...Công ty môi trường Newtech Co
 
Ventas rápidas y seguras movilizando su fuerza de ventas - Sergio Aquino
Ventas rápidas y seguras movilizando su fuerza de ventas - Sergio AquinoVentas rápidas y seguras movilizando su fuerza de ventas - Sergio Aquino
Ventas rápidas y seguras movilizando su fuerza de ventas - Sergio AquinoGeneXus
 
Gesundes Lingenau - Herbst 2012
Gesundes Lingenau - Herbst 2012Gesundes Lingenau - Herbst 2012
Gesundes Lingenau - Herbst 2012gemeindelingenau
 
Anukret 192-on-adjustment-of-customs-duty-and-excise-tax
Anukret 192-on-adjustment-of-customs-duty-and-excise-taxAnukret 192-on-adjustment-of-customs-duty-and-excise-tax
Anukret 192-on-adjustment-of-customs-duty-and-excise-taxSAM SEYLA HUN (DA)
 
Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...
Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...
Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...RootedCON
 
Curso Básico de Redes LAN
Curso Básico de Redes LANCurso Básico de Redes LAN
Curso Básico de Redes LANAlexandra6924
 
Export consulting for IT&C companies
Export consulting for IT&C companiesExport consulting for IT&C companies
Export consulting for IT&C companiesAilanthus Advance SL
 
Informe Cuatrecasas & ManpowerGroup
Informe Cuatrecasas & ManpowerGroupInforme Cuatrecasas & ManpowerGroup
Informe Cuatrecasas & ManpowerGroupSergi Serrano
 
Manual instruções Koos Jané
Manual instruções Koos JanéManual instruções Koos Jané
Manual instruções Koos JanéViver Qualidade
 
TecnologiaycomunicacionTURISMO
TecnologiaycomunicacionTURISMOTecnologiaycomunicacionTURISMO
TecnologiaycomunicacionTURISMOrobr2702
 
Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014
Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014
Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014Terbish Bolormaa
 
FOLLETO CURSO VIRTUAL
FOLLETO CURSO VIRTUALFOLLETO CURSO VIRTUAL
FOLLETO CURSO VIRTUALcamilamongui
 
YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016
YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016
YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016Daisy Miranda
 

Destaque (20)

Acuario de Gijón
 Acuario de Gijón Acuario de Gijón
Acuario de Gijón
 
Trabajo historia fernando b2 d astérix
Trabajo historia fernando b2 d astérixTrabajo historia fernando b2 d astérix
Trabajo historia fernando b2 d astérix
 
Gordon Biersch Final Book
Gordon Biersch Final BookGordon Biersch Final Book
Gordon Biersch Final Book
 
Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...
Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...
Thông tư 26-2011-BTNMT - Quy định chi tiết một số điều của Nghị định số 29 về...
 
Ventas rápidas y seguras movilizando su fuerza de ventas - Sergio Aquino
Ventas rápidas y seguras movilizando su fuerza de ventas - Sergio AquinoVentas rápidas y seguras movilizando su fuerza de ventas - Sergio Aquino
Ventas rápidas y seguras movilizando su fuerza de ventas - Sergio Aquino
 
Gesundes Lingenau - Herbst 2012
Gesundes Lingenau - Herbst 2012Gesundes Lingenau - Herbst 2012
Gesundes Lingenau - Herbst 2012
 
Anukret 192-on-adjustment-of-customs-duty-and-excise-tax
Anukret 192-on-adjustment-of-customs-duty-and-excise-taxAnukret 192-on-adjustment-of-customs-duty-and-excise-tax
Anukret 192-on-adjustment-of-customs-duty-and-excise-tax
 
Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...
Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...
Raúl Siles y José A. Guasch - Seguridad Web de aplicaciones basadas en DNI-e ...
 
Cadeira burigoto
Cadeira burigotoCadeira burigoto
Cadeira burigoto
 
Crónica asamblea región 2
Crónica asamblea región 2Crónica asamblea región 2
Crónica asamblea región 2
 
Curso Básico de Redes LAN
Curso Básico de Redes LANCurso Básico de Redes LAN
Curso Básico de Redes LAN
 
Export consulting for IT&C companies
Export consulting for IT&C companiesExport consulting for IT&C companies
Export consulting for IT&C companies
 
Informe Cuatrecasas & ManpowerGroup
Informe Cuatrecasas & ManpowerGroupInforme Cuatrecasas & ManpowerGroup
Informe Cuatrecasas & ManpowerGroup
 
Manual instruções Koos Jané
Manual instruções Koos JanéManual instruções Koos Jané
Manual instruções Koos Jané
 
TecnologiaycomunicacionTURISMO
TecnologiaycomunicacionTURISMOTecnologiaycomunicacionTURISMO
TecnologiaycomunicacionTURISMO
 
Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014
Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014
Bolormaa Terbish_Der ''Mongolenfleck''_02.Apr.2014
 
FOLLETO CURSO VIRTUAL
FOLLETO CURSO VIRTUALFOLLETO CURSO VIRTUAL
FOLLETO CURSO VIRTUAL
 
YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016
YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016
YO RECICLO, PROYECTO DE AULA AMBIENTAL , UCEVA 2016
 
Fcn assignment
Fcn assignmentFcn assignment
Fcn assignment
 
Mi Puerto Rico
Mi Puerto RicoMi Puerto Rico
Mi Puerto Rico
 

Semelhante a Semantic Search overview at SSSW 2012

Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Yandex
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technologyStefanos Anastasiadis
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationRichard Wallis
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesRichard Wallis
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for LibrariesRichard Wallis
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Marco Brambilla
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalCarsten Eickhoff
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us? Andrea Volpini
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0animove
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jDebanjan Mahata
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit Ipkaviya
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreBarbaraStarr2009
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignMarianne Sweeny
 
Metadata and Taxonomies for More Flexible Information Architecture
Metadata and Taxonomies for More Flexible Information Architecture Metadata and Taxonomies for More Flexible Information Architecture
Metadata and Taxonomies for More Flexible Information Architecture jrhowe
 

Semelhante a Semantic Search overview at SSSW 2012 (20)

Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"Питер Мика "Making the web searchable"
Питер Мика "Making the web searchable"
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Contextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data FoundationContextual Computing: Laying a Global Data Foundation
Contextual Computing: Laying a Global Data Foundation
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Contextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of EntitiesContextual Computing - Knowledge Graphs & Web of Entities
Contextual Computing - Knowledge Graphs & Web of Entities
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...Exploratory Search upon Semantically Described Web Data Sources: Service regi...
Exploratory Search upon Semantically Described Web Data Sources: Service regi...
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Riley-o.com
Riley-o.comRiley-o.com
Riley-o.com
 
What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Skb web2.0
Skb web2.0Skb web2.0
Skb web2.0
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 
CS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit ICS6010 Social Network Analysis Unit I
CS6010 Social Network Analysis Unit I
 
Not Your Mom's SEO
Not Your Mom's SEONot Your Mom's SEO
Not Your Mom's SEO
 
Unit 1
Unit 1Unit 1
Unit 1
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
 
Search Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By DesignSearch Solutions 2011: Successful Enterprise Search By Design
Search Solutions 2011: Successful Enterprise Search By Design
 
Metadata and Taxonomies for More Flexible Information Architecture
Metadata and Taxonomies for More Flexible Information Architecture Metadata and Taxonomies for More Flexible Information Architecture
Metadata and Taxonomies for More Flexible Information Architecture
 

Mais de Peter Mika

What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?Peter Mika
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in PracticePeter Mika
 
Hackathon s pb
Hackathon s pbHackathon s pb
Hackathon s pbPeter Mika
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisPeter Mika
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchablePeter Mika
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic WebPeter Mika
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011Peter Mika
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009Peter Mika
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyPeter Mika
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin YahooPeter Mika
 

Mais de Peter Mika (10)

What happened to the Semantic Web?
What happened to the Semantic Web?What happened to the Semantic Web?
What happened to the Semantic Web?
 
Knowledge Integration in Practice
Knowledge Integration in PracticeKnowledge Integration in Practice
Knowledge Integration in Practice
 
Hackathon s pb
Hackathon s pbHackathon s pb
Hackathon s pb
 
Investigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log AnalysisInvestigating the Semantic Gap through Query Log Analysis
Investigating the Semantic Gap through Query Log Analysis
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
Publishing data on the Semantic Web
Publishing data on the Semantic WebPublishing data on the Semantic Web
Publishing data on the Semantic Web
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Semantic Search Summer School2009
Semantic Search Summer School2009Semantic Search Summer School2009
Semantic Search Summer School2009
 
Year of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkeyYear of the Monkey: Lessons from the first year of SearchMonkey
Year of the Monkey: Lessons from the first year of SearchMonkey
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 

Último

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Último (20)

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Semantic Search overview at SSSW 2012

  • 1. Semantic Search Peter Mika Senior Research Scientist Yahoo! Research With contributions from Thanh Tran (KIT)
  • 2. Yahoo! serves over 700 million users in 25 countries -2-
  • 3. Yahoo! Research: visit us at research.yahoo.com -3-
  • 4. Yahoo! Research Barcelona • Established January, 2006 • Led by Ricardo Baeza-Yates • Research areas – Web Mining • content, structure, usage – Social Media – Distributed Systems – Semantic Search -4-
  • 5. Search is really fast, without necessarily being intelligent -5-
  • 6. Why Semantic Search? Part I • Improvements in IR are harder and harder to come by – Machine learning using hundreds of features • Text-based features for matching • Graph-based features provide authority – Heavy investment in computational power, e.g. real-time indexing and instant search • Remaining challenges are not computational, but in modeling user cognition – Need a deeper understanding of the query, the content and/or the world at large – Could Watson explain why the answer is Toronto? -6-
  • 7. Poorly solved information needs • Multiple interpretations – paris hilton • Long tail queries Many of these queries would not be asked by – george bush (and I mean the beer brewer in Arizona) users, who learned over • Multimedia search time what search – paris hilton sexy technology can and can not do. • Imprecise or overly precise searches – jim hendler – pictures of strong adventures people • Searches for descriptions – countries in africa – 32 year old computer scientist living in barcelona – reliable digital camera under 300 dollars -7-
  • 9. Why Semantic Search? Part II • The Semantic Web is now a reality – Large amounts of RDF data – Heterogeneous schemas, quality – Users who are not skilled in writing complex queries (e.g. SPARQL) and may not be experts in the domain • Searching data instead or in addition to searching documents – Direct answers – Novel search tasks -9-
  • 10. Example: direct answers in search Points of Faceted interest in Information search for Information box Vienna, from the Shopping with content from Austria Knowledgeresults and links to Yahoo! Graph Travel Since Aug, 2010, ‘regular’ search results are ‘Powered by Bing’ - 10 -
  • 11. Novel search tasks • Aggregation of search results – e.g. price comparison across websites • Analysis and prediction – e.g. world temperature by 2020 • Semantic profiling – Ontology-based modeling of user interests • Semantic log analysis – Linking query and navigation logs to ontologies • Support for complex tasks (search apps) – e.g. booking a vacation using a combination of services - 11 -
  • 12. Contextual (pervasive, ambient) search Yahoo! Connected TV: Widget engine embedded into the TV Yahoo! IntoNow: recognize audio and show related content - 12 -
  • 13. Interactive search and task completion - 13 -
  • 14. Why Semantic Search? Part III • There is a use case – Consumers want to understand content – Publishers want consumers to understand their content • Semantic Web standards seem to be a good fit http://en.wikipedia.org/wiki/Underpants_Gnomes - 14 -
  • 15. Example: rNews • RDFa vocabulary for news articles – Easier to implement than NewsML – Easier to consume for news search and other readers, aggregators • Under development at the IPTC – March: v0.1 approved – Final version by Sept - 15 -
  • 16. Example: Facebook’s Like and the Open Graph Protocol • The ‘Like’ button provides publishers with a way to promote their content on Facebook and build communities – Shows up in profiles and news feed – Site owners can later reach users who have liked an object – Facebook Graph API allows 3rd party developers to access the data • Open Graph Protocol is an RDFa-based format that allows to describe the object that the user ‘Likes’ - 16 -
  • 17. Example: Facebook’s Open Graph Protocol • RDF vocabulary to be used in conjunction with RDFa – Simplify the work of developers by restricting the freedom in RDFa • Activities, Businesses, Groups, Organizations, People, Places, Products and Entertainment • Only HTML <head> accepted • http://opengraphprotocol.org/ <html xmlns:og="http://opengraphprotocol.org/schema/"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media- imdb.com/images/rock.jpg" /> … </head> ... - 17 -
  • 18. Example: schema.org • Agreement on a shared set of schemas for common types of web content – Bing, Google, and Yahoo! as initial supporters – Similar in intent to sitemaps.org (2006) • Use a single format to communicate the same information to all three search engines • Support for microdata • schema.org covers areas of interest to all search engines – Business listings (local), creative works (video), recipes, reviews – User defined extensions • Each search engine continues to develop its products - 18 -
  • 19. Documentation and OWL ontology - 19 -
  • 20. Current state of metadata on the Web • 31% of webpages, 5% of domains contain some metadata – Analysis of the Bing Crawl (US crawl, January, 2012) – RDFa is most common format • By URL: 25% RDFa, 7% microdata, 9% microformat • By eTLD (PLD): 4% RDFa, 0.3% microdata, 5.4% microformat – Adoption is stronger among large publishers • Especially for RDFa and microdata • See also – P. Mika, T. Potter. Metadata Statistics for a Large Web Corpus, LDOW 2012 – H.Mühleisen, C.Bizer.Web Data Commons - Extracting Structured Data from Two Large Web Corpo , LDOW 2012 - 20 -
  • 21. Exponential growth in RDFa data Another five-fold increase between October 2010 and January, 2012 Five-fold increase between March, 2009 and October, 2010 Percentage of URLs with embedded metadata in various formats - 21 -
  • 23. Semantic Search: a definition • Semantic search is a retrieval paradigm that – Makes use of the structure of the data or explicit schemas to understand user intent and the meaning of content – Exploits this understanding at some part of the search process • Web search vs. vertical/enterprise/desktop search • Related fields: – XML retrieval – Keyword search in databases – Natural Language Retrieval - 23 -
  • 24. Semantics at every step of the IR process bla bla bla? The IR engine The Web Query interpretation q=“bla” * 3 Crawling and indexing bla bla bla Indexing θ(q,d) “bla” Ranking bla bla bla bla bla bla Result presentation - 24 -
  • 25. Crawling and Indexing
  • 26. Data on the Web • Most web pages on the Web are generated from structured data – Data is stored in relational databases (typically) – Queried through web forms – Presented as tables or simply as unstructured text • The structure and semantics (meaning) of the data is not directly accessible to search engines • Two solutions – Extraction using Information Extraction (IE) techniques (implicit metadata) • Supervised vs. unsupervised methods – Relying on publishers to expose structured data using standard Semantic Web formats (explicit metadata) • Particularly interesting for long tail content - 26 -
  • 27. Information Extraction methods • Natural Language Processing • Extraction of triples – Suchanek et al. YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia, WWW, 2007. – Wu and Weld. Autonomously Semantifying Wikipedia, CIKM 2007. • Filling web forms automatically (form-filling) – Madhavan et al. Google's Deep-Web Crawl. VLDB 2008 • Extraction from HTML tables – Cafarella et al. WebTables: Exploring the Power of Tables on the Web. VLDB 2008 • Wrapper induction – Kushmerick et al. Wrapper Induction for Information ExtractionText extraction. IJCAI 2007 - 27 -
  • 28. Semantic Web • Sharing data across the Web – Publish information in standard formats (RDF, RDFa) – Share the meaning using powerful, logic-based languages (OWL, RIF) – Query using standard languages and protocols (HTTP, SPARQL) • Two main forms of publishing – Linked Data • Data published as RDF documents linked to other RDF documents and/or using SPARQL end-points • Community effort to re-publish large public datasets (e.g. Dbpedia, open government data) – RDFa • Data embedded inside HTML pages • Recommended for site owners by Yahoo, Google, Facebook - 28 -
  • 29. Crawling the Semantic Web • Linked Data – Similar to HTML crawling, but the the crawler needs to parse RDF/XML (and others) to extract URIs to be crawled – Semantic Sitemap/VOID descriptions • RDFa – Same as HTML crawling, but data is extracted after crawling – Mika et al. Investigating the Semantic Gap through Query Log Analysis, ISWC 2010. • SPARQL endpoints – Endpoints are not linked, need to be discovered by other means – Semantic Sitemap/VOID descriptions - 29 -
  • 30. Data fusion • Ontology matching – Widely studied in Semantic Web research, see e.g. list of publications at ontologymatching.org • Unfortunately, not much of it is applicable in a Web context due to the quality of ontologies • Entity resolution – Logic-based approaches in the Semantic Web – Studied as record linkage in the database literature • Machine learning based approaches, focusing on attributes – Graph-based approaches, see e.g. the work of Lisa Getoor are applicable to RDF data • Improvements over only attribute based matching • Blending – Merging objects that represent the same real world entity and reconciling information from multiple sources - 30 -
  • 31. Data quality assessment and curation • Heterogeneity, quality of data is an even larger issue – Quality ranges from well-curated data sets (e.g. Freebase) to microformats • In the worst of cases, the data becomes a graph of words – Short amounts of text: prone to mistakes in data entry or extraction • Example: mistake in a phone number or state code • Quality assessment and data curation – Quality varies from data created by experts to user-generated content – Automated data validation • Against known-good data or using triangulation • Validation against the ontology or using probabilistic models – Data validation by trained professionals or crowdsourcing • Sampling data for evaluation – Curation based on user feedback - 31 -
  • 32. Indexing • Search requires matching and ranking – Matching selects a subset of the elements to be scored • The goal of indexing is to speed up matching – Retrieval needs to be performed in milliseconds – Without an index, retrieval would require streaming through the collection • The type of index depends on the query model to support – DB-style indexing – IR-style indexing - 32 -
  • 33. IR-style indexing • Index data as text – Create virtual documents from data – One virtual document per subgraph, resource or triple • typically: resource • Key differences to Text Retrieval – RDF data is structured – Minimally, queries on property values are required - 33 -
  • 34. Horizontal index structure • Two fields (indices): one for terms, one for properties • For each term, store the property on the same position in the property index – Positions are required even without phrase queries • Query engine needs to support the alignment operator • ✓ Dictionary is number of unique terms + number of properties • Occurrences is number of tokens * 2 - 34 -
  • 35. Vertical index structure • One field (index) per property • Positions are not required – But useful for phrase queries • Query engine needs to support fields • Dictionary is number of unique terms • Occurrences is number of tokens • ✗ Number of fields is a problem for merging, query performance - 35 -
  • 36. Distributed indexing • MapReduce is ideal for building inverted indices – Map creates (term, {doc1}) pairs – Reduce collects all docs for the same term: (term, {doc1, doc2…} – Sub-indices are merged separately • Term-partitioned indices • Peter Mika. Distributed Indexing for Semantic Search, SemSearch 2010. - 36 -
  • 38. What is search? • The search problem – A data collection consisting of a set of items (units of retrieval) – Information needs expressed as queries – Ambiguity in the interpretation of the data and/or the queries • Search is the task of efficiently finding items that are relevant to the information need – Query processing mainly focuses on efficiency of matching whereas ranking deals with degree of matching (relevance)! - 38 -
  • 39. Types of collections and query paradigms • Types of collections – Structured data with well defined schemas – Semi-structured data with incomplete or no schemas – Data that largely comprise text – Hybrid / embedded data • Units of retrieval – Subgraphs, triples, resources etc. • Query paradigms – Natural language texts and keywords – Form-based inputs – Formal structured queries - 39 -
  • 40. Types of data models (1) • Textual – Bag-of-words – Represent documents, text in structured data,…, real-world objects (captured as structured data) – Lacks “structure” • Text structure, e.g. linguistic structure, outlines, hyperlinks etc. • Structure in structured data representation term (statistics) In combination with combination Cloud Computing Cloud technologies, promising Computing solutions for the Technologies management of `big solutions data' have emerged. management Existing industry `big data' solutions are able to industry support complex solutions queries and analytics support tasks with terabytes of complex data. For example, …… using a Greenplum. - 40 -
  • 41. Types of data models (2) • Graph structure – Relationships in the data • Hyperlinks • Typed relationships – Ontology creator Picture Person Bob - 41 -
  • 42. Types of data models (3) • Hybrid – RDF data embedded in text (RDFa) - 42 -
  • 43. Formalisms for querying semantic data (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” - 43 -
  • 44. Formalisms for querying semantic data (2) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” • Unstructured – NL – Keywords shared apartment Berlin Alice - 44 -
  • 45. Formalisms for querying semantic data (3) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” • Fully-structured – SPARQL: BGP, filter, optional, union, select, construct, ask, describe • PREFIX ns: <http://example.org/ns#> SELECT ?x WHERE { ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” } - 45 -
  • 46. Formalisms for querying semantic data (4) • Hybrid: both content and structure constraints “shared apartment Berlin Alice” ?x ns:knows ? y. ?y ns:name “Alice”. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” - 46 -
  • 47. Summary: data and queries in Semantic Search NL Form- / facet- Structured Queries Keywords Questions based Inputs (SPARQL) Ambiquities Query Semantic Search target different group of users, information needs, and types of data. Query processing for semantic search is hybrid combination of techniques! Data RDF data Semi- OWL ontologies with Structured embedded in Structured rich, formal RDF data text (RDFa) RDF data semantics Ambiquities: confidence degree, truth/trust - 47 -
  • 48. Processing hybrid graph patterns (1) Example information need “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone working at KIT.” apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y works age trouble with bob FluidOps 34 Peter sunset.jpg au Bob is a good friend title Beautiful th of mine. We went to or Sunset Semantic the same university, Germany c re Alice Search and also shared an at o r or apartment in Berlin creato ye th r au kn ar in 2008. The trouble ow Germany 2009 s with Bob is that he Bob takes much better knows Thanh wo d photos than I do: e rk at s c KIT lo - 48 -
  • 49. Matching keyword query against text • Retrieve documents • Inverted list (inverted index) – keyword  {<doc1, pos, score>,…,<doc2, pos, score, ...>, ...} • AND-semantics: top-k join shared Berlin Alice shared Berlin Alice D1 D1 D1 shared = berlin = alice shared - 49 -
  • 50. Matching structured query against structured data • Retrieve data for triple patterns • Index on tables • Multiple “redundant” indexes to cover different access patterns • Join (conjunction of triples) • Blocking, e.g. linear merge join (required sorted input) • Non-blocking, e.g. symmetric hash-join • Materialized join indexes ?x ns:knows ?y. ?x ns:knows ?z. Per1 ns:works ?v ?v ns:name “KIT” SP-index PO-index ?z ns: works ?v. ?v ns:name “KIT” = = = Per1 ns:works Ins1 Ins1 ns:name KIT Per1 ns:works Ins1Ins1 ns:name KIT - 50 -
  • 51. Matching keyword query against structured data • Retrieve keyword elements • Using inverted index – keyword  {<el1, score, ...>, <el2, score, ...>,…} • Exploration / “Join” • Data indexes for triple lookup • Materialized index (paths up to graphs) • Top-k Steiner tree search, top-k subgraph exploration Alice Bob KIT Alice Bob KIT ↔ ↔ = = Alice ns:knows Bob Inst1 ns:name KIT Bob ns:works Inst1 - 51 -
  • 52. Matching structured query against text • Offilne IE • Online IE, i.e., “retrieve “ is as follows • Derive keywords to retrieve relevant documents • On-the-fly information extraction, i.e., phrase pattern matching “X name Y” • Retrieve extracted data for structured part • Retrieve documents for derived text patterns, e.g. sequence, windows, reg. exp. ?x ns:knows ?y. ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” knows name KIT - 52 -
  • 53. Matching structured query against text • Index • Inverted index for document retrieval and pattern matching • Join index  inverted index for storing materialized joins between keywords • Neighborhood indexes for phrase patterns ?x ns:knows ?y. ?x ns:knows ? ?z ns: works ?v. ?v ns:name “K KIT name knows name KIT - 53 -
  • 54. Query processing – main tasks • Retrieval – Documents , data elements, triples, Query paths, graphs – Inverted index,…, but also other (B+ tree) – Index documents, triples, materialized paths • Join – Different join implementations, efficiency depends on availability of indexes – Non-blocking join good for early result Data reporting and for “unpredictable” Linked Data / data streams scenario - 54 -
  • 55. Query processing – more tasks • More complex queries: disjunction, aggregation, grouping, analytics… Query • Join order optimization • Approximate – Approximate the search space – Approximate the results (matching, join) • Parallelization • Top-k – Use only some entries in the input streams to produce k results Data • Multiple sources – Federation, routing – On-the-fly mapping, similarity join • Hybrid – Join text and data - 55 -
  • 56. Query processing on the Web - research challenges and opportunities • Large amount of semantic data • Optimization, • Data inconsistent, parallelization redundant, and low quality • Approximation • Hybrid querying and data • Large amount of data management embedded in text • Federation, routing • Large amount of sources • Online schema mappings • Large amount of links • Similarity join between sources - 56 -
  • 58. Ranking – problem definition Query • Ambiguities arise when representation is incomplete / imprecise • Ambiguities at the level of • elements (content ambiguity) • structure between elements (structure ambiguity) Data Due to ambiguities in the representation of the information needs and the underlying resources, the results cannot be guaranteed to exactly match the query. Ranking is the problem of determining the degree of matching using some -notions of relevance. 58 -
  • 59. Content ambiguity apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y works age trouble with bob FluidOps 34 Peter sunset.jpg au Bob is a good friend title Beautiful t ho of mine. We went to Sunset r the same university, Germany Semantic cre Alice Search and also shared an at o r or apartment in Berlin creato ye th r au kn ar in 2008. The trouble ow Germany 2009 s with Bob is that he Bob takes much better knows Thanh wo d photos than I do: te rk s ca KIT lo What is meant by “Berlin” in the query? What is meant by “KIT” in the query? What is meant by “Berlin” in the data? What is meant by “KIT” in the data? A city with the name Berlin? a person? A research group? a university? a location? - 59 -
  • 60. Structure ambiguity apartment shared Berlin Alice ?x ns:knows ?z. ?z ns: works ?v. ?v ns:name “KIT” ?y ns:name “Alice”. ?x ns:knows ? y works age trouble with bob FluidOps 34 Peter sunset.jpg au Bob is a good friend title Beautiful t ho of mine. We went to Sunset r the same university, Germany Semantic cre Alice Search and also shared an at o r or apartment in Berlin creato ye th r au kn ar in 2008. The trouble ow Germany 2009 s with Bob is that he Bob takes much better knows Thanh wo d photos than I do: te rk s ca KIT lo What is the connection between What is meant by “works”? “Berlin” and “Alice”? Works at? employed? Friend? Co-worker? - 60 -
  • 61. Ambiguity • Ambiguities arise when data or query allow for multiple interpretations, i.e. multiple matches – Syntactic, e.g. works vs. works at – Semantic, e.g. works vs. employ • “Aboutness”, i.e., contain some elements which represent the correct interpretation – Ambiguities arise when matching elements of different granularities – Does i contains the interpretation for j, given some part(s) of i (syntactically/semantically) match j – E.g. Berlin vs. “…we went to the same university, and also, we shared an apartment in Berlin in 2008…” • Strictly speaking, ranking is performed after syntactic / semantic matching is done! - 61 -
  • 62. Features: What to use to deal with ambiguities? What is meant by “Berlin”? What is the connection between “Berlin” and “Alice”? • Content features – Frequencies of terms: d more likely to be “about” a query term k when d more often, mentions k (probabilistic IR) – Co-occurrences: terms K that often co-occur form a contextual interpretation, i.e., topics (cluster hypothesis) • Structure features – Consider relevance at level of fields – Linked-based popularity - 62 -
  • 63. Ranking paradigms • Explicit relevance model – Foundation: probability ranking principle – Ranking results by the posterior probability (odds) of being observed in the relevant class: – P(w|R) varies in different approaches • binary independence model • Two-Poisson model • BM25 P(D|R) P(D/N) P ( D | R) = ∏ P( w | R)∏ (1 − P( w | N )) w∈D w∉D - 63 -
  • 64. Ranking paradigms • No explicit notion of relevance: similarity between the query and the document model – Vector space model (cosine similarity) – Language models (KL divergence) Sim( q, d ) = Cos ( ( w1,d ,..., wt ,d ), ( w1, q ,..., wk ,q )) P (t | θq ) Sim( q, d ) = −KL (θq ||θ d ) = −∑P (t | θq ) log( t∈V P (t | θd ) - 64 -
  • 65. Model construction • How to obtain – Relevance models? – Weights for query / document terms? – Language models for document / queries? - 65 -
  • 66. Content-based features • Document statistics, e.g. • An object is more likely – Term frequency about “Berlin” when – Document length • it contains a relatively • Collection statistics, e.g. high number of – Inverse document frequency mentions of the term “Berlin” – Background language models • the number of mentions of this term in tf the overall collection is wt , d = ∗idf relatively low |d | tf P (t | θd ) = λ + (1 − λ) P (t | C ) |d | - 66 -
  • 67. Structure-based features • Consider structure of objects – Content-based features for structured objects, documents and for general tuples P (t | θd ) = ∑α f ∈Fd f P (t | θ f ) • An object is more likely about “Berlin when • one of its (important) fields contains a relatively high number of mentions of the term “Berlin” - 67 -
  • 68. Structure-based features (2) • PageRank – Link analysis algorithm – Measuring relative importance of nodes – Link counts as a vote of support – The PageRank of a node recursively depends on the number and PageRank of all nodes that link to it (incoming links) • ObjectRank – Types and semantics of links vary in structured data setting – Authority transfer schema graph specifies connection strengths – Recursively compute authority transfer data graph An object about “Berlin” is more important than another when • a relatively large number of objects are linked to it - 68 -
  • 69. In practice • Many more aspects of relevance – User profiles – History – Context, e.g. geo-location – etc. • Combination of features using Machine Learning – Several hundred features in modern search engines • Pre-compute static features such as PageRank/ObjectRank • Two-phase scoring for efficiency – Round 1: easy to compute features – Round 2: more expensive features - 69 -
  • 70. Evaluation Harry Halpin, Daniel Herzig, Peter Mika, Jeff Pound, Henry Thompson, Roi Blanco, Thanh Tran Duc
  • 71. Semantic Search challenge (2010/2011) • Two tasks – Entity Search • Queries where the user is looking for a single real world object • Pound et al. Ad-hoc Object Retrieval in the Web of Data, WWW 2010. – List search (new in 2011) • Queries where the user is looking for a class of objects • Billion Triples Challenge 2009 dataset • Evaluated using Amazon’s Mechanical Turk – Halpin et al. Evaluating Ad-Hoc Object Retrieval, IWEST 2010 – Blanco et al. Repeatable and Reliable Search System Evaluation using Crowd-Sourcing, SIGIR2011 - 71 -
  • 72. Evaluation form - 72 -
  • 73. Other evaluations • TREC Entity Track – Related Entity Finding • Entities related to a given entity through a particular relationship • Retrieval over documents (ClueWeb 09 collection) • Example: (Homepages of) airlines that fly Boeing 747 – Entity List Completion • Given some elements of a list of entities, complete the list • Question Answering over Linked Data – Retrieval over specific datasets (Dbpedia and MusicBrainz) – Full natural language questions of different forms – Correct results defined by an equivalent SPARQL query – Example: Give me all actors starring in Batman Begins. - 73 -
  • 75. Search interface • Input and output functionality – helping the user to formulate complex queries – presenting the results in an intelligent manner • Semantic Search brings improvements in – Query formulation – Snippet generation – Suggesting related entities – Adaptive and interactive presentation • Presentation adapts to the kind of query and results presented • Object results can be actionable, e.g. buy this product – Aggregated search • Grouping similar items, summarizing results in various ways • Filtering (facets), possibly across different dimensions – Task completion • Help the user to fulfill the task by placing the query in a task context - 75 -
  • 76. Query formulation • “Snap-to-grid”: suggest the most likely interpretation of the query – Given the ontology or a summary of the data – While the user is typing or after issuing the query – Example: Freebase suggest, TrueKnowledge - 76 -
  • 77. Enhanced results/Rich Snippets – Use mark-up from the webpage to generate search snippets • Originally invented at Yahoo! (SearchMonkey) – Google, Yahoo!, Bing, Yandex now consume schema.org markup • Validators available from Google and Bing - 77 -
  • 78. Other result presentation tasks • Select the most relevant resources within an RDF document – Penin et al. Snippet Generation for Semantic Web Search Engines, ASWC 2010 • For each resource, rank the properties to be displayed • Natural Language Generation (NLG) – Verbalize, explain results - 78 -
  • 81. Related entities Related actors and movies - 81 -
  • 83. Resources • Books – Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. ACM Press. 2011 • Survey papers – Thanh Tran, Peter Mika. Survey of Semantic Search Approaches. Under submission, 2012. • Conferences and workshops – ISWC, ESWC, WWW, SIGIR, CIKM, SemTech – Semantic Search workshop series – Exploiting Semantic Annotations in Information Retrieval (ESAIR) – Entity-oriented Search (EOS) workshop • Upcoming – Joint Intl. Workshop on Entity-oriented and Semantic Search (JIWES) at SIGIR 2012 – ESAIR 2012 at CIKM 2012 - 83 -
  • 84. The End • Many thanks to Thanh Tran (KIT) and members of the SemSearch group at Yahoo! Research in Barcelona • Contact – pmika@yahoo-inc.com – Internships available for PhD students (deadline in January) - 84 -

Notas do Editor

  1. In fact, some of these searches are so hard that the users don’t even try them anymore
  2. With ads, the situation is even worse due to the sparsity problem. Note how poor the ads are…
  3. Search is a form of content aggregation
  4. Facebook invited, but continues to pursue OGP
  5. Semantic search can be seen as a retrieval paradigm Centered on the use of semantics Incorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.
  6. Bar celona
  7. Close to the topic of keyword-search in databases, except knowledge-bases have a schema-oblivious design Different papers assume vastly different query needs even on the same type of data
  8. Miss structural information in texts Hyperlinks Linguistic structure Positional information
  9. - Real world objects
  10. SELECT Returns all, or a subset of, the variables bound in a query pattern match. CONSTRUCT Returns an RDF graph constructed by substituting variables in a set of triple templates. ASK Returns a boolean indicating whether a query pattern matches or not. DESCRIBE Returns an RDF graph that describes the resources found. Graph patterns are defined recursively. A graph pattern may have zero or more optional graph patterns, and any part of a query pattern may have an optional part. In this example, there are two optional graph patterns. Section 6 introduces the ability to make portions of a query optional; Section 7 introduces the ability to express the disjunction of alternative graph patterns; and Section 8 introduces the ability to constrain portions of a query to particular source graphs. Section 8 also presents SPARQL&apos;s mechanism for defining the source graphs for a query. Basic graph patterns allow applications to make queries where the entire query pattern must match for there to be a solution. For every solution of a query containing only group graph patterns with at least one basic graph pattern, every variable is bound to an RDF Term in a solution. However, regular, complete structures cannot be assumed in all RDF graphs. It is useful to be able to have queries that allow information to be added to the solution where the information is available, but do not reject the solution because some part of the query pattern does not match. Optional matching provides this facility: if the optional part does not match, it creates no bindings but does not eliminate the solution. The UNION pattern combines graph patterns; each alternative possibility can contain more than one triple pattern: SPARQL provides a means of combining graph patterns so that one of several alternative graph patterns may match. If more than one of the alternatives matches, all the possible pattern solutions are found.
  11. Web data: Text + Linked Data + Semi-structured RDF + Hybrid data that can be conceived as forming data graphs Hear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data: - Query as a set of constrains Match structured data Match text
  12. - Less than 5 percent of IR papers deal with query processing and the aspect of efficiency
  13. Partitioning has impact on performance!) Blocking: iterator-based approaches Non-blocking: good for streaming, good we cannot wait for some parts of the results to be completely worked-off Link data: cannot wait for sources, (some are slower then other) thus better to push data into query processing as the they come instead of pulling data and wait (busy waiting) Top-k:
  14. -phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web. - patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window. For extraction: relation patterns For text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
  15. -phrase patterns (e.g., “X is the capital of Y”) for large scale extraction. Such simple patterns, when coupled with the richness and redundancy of theWeb, can be very useful in scraping millions or even billions of facts from the Web. - patterns: Matched when keywords or data types Xi appear in sequence. Matched if all keywords/data types/patterns appear within an m-words window. For extraction: relation patterns For text search: entity patterns -When not assuming that all relevant data can be extracted such matching against text still needed: Hybrid search
  16. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge join
  17. Given some materialized indexes  no joins at all Given sorted inputs  sorted merge join join
  18. Every task is a challenge of itself, some more some less well elaborated There are separate challenges for every problems
  19. Approximate  many results  need ranking Ranking also needed in the case where qp is complete and sound, but queries and data representation so imprecise such that we have to deal with too many results
  20. Web data: Text + Linked Data + Semi-structured RDF + Hybrid data that can be conceived as forming data graphs Hear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data: - Query as a set of constrains Match structured data Match text
  21. Syntactic works vs. works at Semantic works vs. employ
  22. Aboutness (Semantics employ at matching not ranking!)
  23. text which contains a large mentions of “Berlin” is likely to be about “Berlin” i is more likely to be the correct interpretation of K when terms in K co-occur in a large number of context (bag of words) associated with i “ Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people “ Berlin and apartment” more often co-occur in the geographic location context/topic than in the context of people
  24. - Underlying most research on probabilistic models of Information Retrieval is PRP - are ranked by the posterior probability that they belong to the relevant class . . Robertson [19] also shows that it is equivalent to rank the documents by the odds of their being observed in the relevant class: - Here represents the class of documents relevant to user’s query, and is the class of non-relevant documents. - The P(D|R)/ estimation of differs in various models The Binary Independence Model [17, 23] treats each document as a binary vector over the vocabulary space, ignoring word frequencies. - 2-Poisson Model [18] goes a step further in modeling term frequencies in documents according to a mixture of two Poisson distributions. - Here are the probabilities of the word being present document sampled from the relevant class. These probabilities estimated using heuristic techniques in the absence of relevance information. - Let’s assume that the query words and the words in relevant documents are sampled identically and independently from a unigram distribution ( finite universe of unigram distributions from which we could sample)
  25. In the multinomial unigram model, each document D is represented as a multinomial probability distribution PðtjhDÞ over all the terms t in the vocabulary. a more general and flexible retrieval model can be obtained by using a comparison of two language models as the basis for ranking. Several authors proposed the use of the Kullback–Leibler (KL)-divergence for ranking, since it is a well established measure for the comparison of probability distributions with some intuitive properties—it always has a non-negative value and equal distributions receive a zero divergence value (Lafferty et al., 2001; Ng, 2001; Xu &amp; Croft, 1999). Using KL-divergence, documents are scored by measuring the divergence between a query model hQ and each document model hD. Since we want to assign a high score for high similarity and a low score for low similarity, the KL-divergence is negated for ranking purposes. where V denotes the set of all terms used in all documents in the collection. KL-divergence is also known as the relative entropy, which is defined as the cross-entropy of the observed distribution (in this case the query) as if it was generated by a reference distribution (in this case the document) minus the entropy of the observed distribution.
  26. Here, n(v;D) is the count of the word v in the document, j d j is the document length, and P(v j C) is the background probability of v.
  27. Probabiligy the term generated by the overall colleciton
  28. Here P(w|Ojk) is the probability of generating w by the th j field of record k. P(w|Ojk) can be computed by treating each Ojk as a document, The generative story is that the document needs to generate the f1 field models, and then each f1 model will generate the query term q1 and the structural term #combine[f2]( q2 ). Next, each f2 field model in each f1 model will generate the query term q2
  29. From the schema graph G(VG, EG), we create the authority transfer schema graph GA (VG, EA ) to reflect the authority flow through the edges of the graph. This may be either a trial and error process, until we are satisfied with the quality of the results, or a domain expert’s task. In particular, for each edge eG = (u → v) of EG, two authority transfer edges, e f G = (u → v) and e b G = (v → u) are created. The two edges carry the label of the schema graph edge and, in addition, each one is annotated with a (potentially different) authority transfer rate - α(e f G) and α(e b G) Correspondingly The motivation for defining two edges for each edge of the schema graph is that authority potentially flows in both directions and not only in the direction that appears in the schema. For example, a paper passes its authority to its authors and vice versa. Notice however, that the authority flow in each direction (defined by the authority transfer rate) may not be the same. For example, a paper that is cited by important papers is clearly important but citing important papers does not make a paper important. Given a data graph D(VD, ED) that conforms to an authority transfer schema graph GA (VG, EA ), ObjectRank derives an authority transfer data graph DA (VD, EA D) (Figure 5) as follows