SlideShare uma empresa Scribd logo
1 de 42
Semantic Web Search
         Searching Documents and Semantic Data on the Web
         Presentation at Information Sciences Institute, USC
Semantic Search Group at the AIFB Institute
Thanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner,
Veli Bicer, Yongtao Ma and Rudi Studer.

http://sites.google.com/site/kimducthanh




    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
1
Structure


         • Motivation
         • Previous and current work
         • Keyword query processing
         • Keyword query result ranking
         • Conclusion




    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
2
Besides documents, there is an increasing amount of structured data on
          the Web such as RDF, RDFa and Linked Data! How can we leverage this
          for enhancing the search experience?

          MOTIVATION


    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
3
RDFa
     …
     <div about="/alice/posts/trouble_with_bob">
         <h2 property="dc:title">The trouble with Bob</h2>
         <h3 property="dc:creator">Alice</h3>

                             Bob is a good friend of mine. We went to the same university, and
                             also shared an apartment in Berlin in 2008. The trouble with Bob is
                             that he takes much better photos than I do:

         <div about="http://example.com/bob/photos/sunset.jpg">
          <img src="http://example.com/bob/photos/sunset.jpg" />
          <span property="dc:title">Beautiful Sunset</span>
          by <span property="dc:creator">Bob</span>.
         </div>
     </div>
     …
                                                                                            adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/



    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
4
RDFa

Bob is a good friend of mine. We         content
went to the same university, and
also shared an apartment in Berlin
in 2008. The trouble with Bob is
that he takes much better photos
than I do:
                                 content




                                                                                                adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
5
Semantic Data




                                                                                                source: http://linkeddata.org/
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
6
Linked Data




                                                                                                adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
7
Addressing Complex Information Needs
     “Information about a friend of Alice, who shared an apartment with
      her in Berlin and knows someone in the field of Semantic Search
      working at KIT”.


                                                    <shared apartment in Berlin with Alice>                                 <knows someone in
                                                                                                                            the field of Semantic
                                                                  <friend of Alice>                                         Search working at KIT>
                                                trouble with bob                                                    FluidOps                     34
                                                                                                                                 Peter
                                                                                                 sunset.jpg
                                                Bob is a good friend
                                                                                                                Beautiful
                                                of mine. We went to                                             Sunset
                                                the same university,                                                         Germany     Semantic
                                                                                                 Alice                                   Search
                                                and also shared an
                                                apartment in Berlin
                                                in 2008. The trouble
                                                with Bob is that he                                                                    Germany    2009
                                                                                                          Bob
                                                takes much better                                                      Thanh
                                                photos than I do:
                                                                                                                                KIT
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
9
Data Sources in SemanticSearch@AIFB Demo

      English Wikipedia

      Data from Linked Open Data
                 DBpedia
                 YAGO
                 Many more


      Live data from Data.gov (US Government)
                 E.g. live data about earthquakes


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
10
Search Intent Interpretation, Refinement
             and Exploration                Keywords




                                                                                                          Query
                                                                                                          Completions

                                                                                                            Term
                                                                                                            Completions




                                     Facets
Vorlesung Knowledge Discovery - Institut AIFB




              KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
    13
Result Inspection, Analysis and Browsing




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
14
OVERVIEW OF WORK


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
15
Search Concepts
      Hybrid Search: Structured queries combined with
       keywords on structured and unstructured data in
       possibly remote (Linked Data) sources
                                                                                                 BACK-END


      Query interpretation: Translation of keywords to
       hybrid queries

      Keyword search (translated hybrid query)
       combined with faceted search: starting with
       keywords and then iterative refinement process
       based on operations on facets
                                                                                                 FRONT-END

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
16
Previous and Current Work

      Semi-structured RDF data management [ISWC09] [TKDE12]
                 Inverted index for RDF data management
                 Structure index
      Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11]
                 Keyword query routing to find relevant sources / relevant
                  combination of sources
                 “Explorative” query processing and adaptive query optimization
                 Combining local and remote Linked Data
      Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11]
                 Ontology and entity result summarization
                 Faceted and keyword search
      Current work: hybrid data search

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
17   Tran Thanh: Schema-agnostic Search
KEYWORD QUERY PROCESSING
           [ICDE09]
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
18
DB-style Keyword Search
     Keyword query processing / translation
“Articles of researchers at Stanford with Turing Award”                                          „Stanford      Article   Turing Award“

                                                                               Specification




                     Keywords might produce large number of
                      matching elements in the data graph
                     The data graph might be large in size
                     Search complexity increases substantially with
                      the size of the graph
                     Large number of results

     Selection                             Set of Queries                                                     Set of Results
                                 1) Query 1                                                             1) Result 1
                                 2) Query 2                                                             2) Result 2

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
19
Query Space
     Schema graph                                                                            Query space




           Main Idea
      Exploration on much reduced the data graph model
         Query space: more compact representation of
                                                          summary
      Online constructionspace space out of schema graph
       called query of query
         Match keywords against labels of resources to find keyword elements
      Substantially elements with elements of schema to obtain query space
         Connect keyword decrease complexity

      Top-k procedure for graph exploration to compute
      Online top-k query graph exploration

       only top-k results
      KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
20
Top-k Query Graph Exploration on Query Space
Paths and their costs                                                                                The resulting query graph




     •       Cost-directed exploration of Steiner graphs
     •       Explore all possible distinct paths starting from keyword elements
     •       At each exploration, take current path with lowest cost
     •       When a connecting element is found, merge paths to construct the query
             graph and add it to candidate list
     •       Top-k terminates when highest cost of the candidate list (the cost of the k-
             ranked query graph) is found to be lower than the lowest possible cost that can
             achieved with paths in the queues yet to be explored
     •       Result: best k query interpretations to be shown to the user

         KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
21
Evaluation – Performance
     • Comparison with bidirectional search [V. Kacholia et al.] and
       search based on graph indexing (1000 BFS, 1000 METIS, 300
       BFS, 300 METIS in [H. He et al.])
     • Query computation + processing time until finding 10 answers
     • Outperforms bidirectional search by at least one order of magn.
     • Performance comparable with indexing based approaches, but
       requires less space
        100000
           10000                                                                                                          Our Solution

               1000                                                                                                       Bidirect
                                                                                                                          1000 BFS
                  100
                                                                                                                          1000 METIS
                     10                                                                                                   300BFS
                        1                                                                                                 300METIS
                                    Q1            Q2             Q3            Q4            Q5   Q6   Q7   Q8   Q9 Q10
                                                              Query Performance on DBLP Data
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
22
KEYWORD QUERY RESULT RANKING
           [CIKM11]
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
23
IR-based Ranking Schemes
      TF*IDF based:
                 Discover, EASE, SPARK
                 [Liu et al, SIGMOD06]

     Score( JRT )                                            Score( r )
                                                 r JRT

     Score(r )                                    Weight (v, r ) Weight (v, Q)
                                       v r ,Q

                                                                                                                  ntf
                                                                                                 Weight (v, r )       nidf
        ntf             1 ln(1 ln(tf ))
                                                                                                                  ndl

        ndl             (1 s) s dl / avdl
                                   N 1
        nidf               ln
                                    df                                                            24


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

24
Proximity-based Ranking Schemes

      EASE, XRANK, BLINKS, etc.
      EASE
                 Proximity between a pair of keywords




                 Overall score of a JRT is aggregation on the score of keyword pairs
      XRANK
                 Ranking of XML documents / elements
                 Proximity here is defined based on w, the smallest text window in
                  n that contains all search keywords



     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
25
Prestige-based Ranking Schemes

      Based on graph structure, i.e. PageRank-like
       methods to determine node prestige
                 XRank [Guo et al, SIGMOD03]
                 ObjectRank [Balmin et al, VLDB04] : considers both
                  global ObjectRank and keyword-specific ObjectRank
                 The probability that edges of different types will be
                  visited are not uniform: requires manual fine-tuning to
                  set the importance of different types of edges
                 Naive: indegree




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
26
Introduction
      Recent study shows that the effectiveness of most
       works are below the expectations (Coffman and Weaver,
           CIKM 2010)
      Problems:
               Proximity does not directly model relevance
               Ad-hoc TF/IDF normalization does not capture the nature
                of keyword search results well (small document length,
                skewed word occurrence statistics)
               PageRank not directly applicable




      KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
27
Overview of the Approach

      Keyword query is short an ambiguous, while data
       (and results) provide rich structure information
       that can be exploited!
      Principled approach to relevance based on
       language models and PRF  estimate model from
       content and structure of PRF results
      Adopt relevance model as a fine-grained model
       representing both content and structure of
       relevant document and queries (relevance class)


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
28
Relevance Models [SIGIR 01]
      Explicit notion of relevance
      Queries and documents are samples from a latent
       representation space, i.e. the relevance model underlying
       the information need




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
29
Relevance Models
                                                                   q1            Israeli
                                                                                                                       sample probabilities
                                                                                                                       P(w|Q)           w
                                 M                                 q2            Palestinian                              .077 palestinian
                                                                                                                          .055 israel
                                 M                                 q3            raids                                    .034 jerusalem
                                 M                                                                                        .033 protest
                                                                   w               ???                                    .027 raid
                                                                                                                          .011 clash
                                                                                                                          .010 bank
                                                                                 P( w, q1...qk )                          .010 west
     P( w | R)                      P( w | q1...qk )                                                                      .010 troop
                                                                                  P(q1...qk )
                                                                                                                                …

                                                                                                   k
     P ( w, q1...qk )                                            P( M ) P( w | M )                       P (qi | M )
                                                M UM                                               i 1


       KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
30
Ranking with Relevance Models

      Probability ranking principle
                                                   P( D | R)                                 P( w | R)
                                                   P( D | N )                         w    D P( w | N )


      See relevance model as query expansion
                 Rank of document is based on the cross-entropy of its
                  model and the relevance model

                                      H ( R || D)                                  P ( w | R) log P( w | D)
                                                                          w V


                                                                           n( w, D)
                                 P( w | D)                           D              (1                D   ) P( w | C )
                                                                             |D|

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
31
Edge-Specific Relevance Models
            Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted
             keyword index:
                       E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}
            Based on PRF results, an edge specific relevance model is constructed for each unique
             edge e based on:




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
32
Edge Specific Resource Models

      Edge-specific resource model:


                 Smoothing with model for the entire resource
      The score of a resource calculated based on cross-entropy
       of edge-specific RM and edge-specific ResM:




                 Alpha allows to control the importance of edges




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
33
Ranking JRTs
      Ranking aggregated JRTs:
                 The cross entropy between the edge-specific RM (Query Model) and
                  geometric mean of combined edge-specific ResM:




      The proposed ranking function is monotonic with respect to the
       individual resource scores (a necessary property for using top-k
       algorithms)


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
34
Experiments
      Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases
      Queries: 50 queries for each dataset including “TREC style” queries and
       “single resource” queries
      Metrics: Three metrics are used: (1) the number of top-1 relevant
       results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)
      Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK,
       CoveredDensity (TF-IDF).
      RM-S: Our approach




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
35
Experiments – Single Resource Queries
     -       Proximity-based approaches perform well
     -       Minimizing compactness results in single resources being ranked high
     -       TF-IDF normalization not as aggressive, not as effective




                                          Reciprocal rank for single resource queries
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
36
Experiments – TREC-style Queries
     -       TF-IDF based approaches performed better
     -       Our approach outperformed existing approaches also in this category,
             providing more stable performance over the entire precision-recall curve




                                         Precision-recall for TREC-style queries on Wikipedia
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
37
Experiment – All Queries

     - Our approach consistently shows superior performance
     - Encouraging, given that this is first study that use a general
       framework for evaluating keyword search ranking




                                                                    MAP scores for all queries
     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
38
Conclusions / Future Work

      Front-to-backend work on using structured data for
       enhancing the search experience
      From backend data management to frontend search
       concepts
      Current work / future directions
                 Managing hybrid data
                 Hybrid query processing / interfaces
                 Ranking hybrid results




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
39
References (1)
            Günter Ladwig, Thanh Tran
             SIHJoin: Querying Remote and Local Linked Data
             In 8th Extended Semantic Web Conference (ESWC'11). Heraklion, Greece, June, 2011 (full
             research paper, 23% acceptance rate).
            Thanh Tran, Lei Zhang, Rudi Studer
             Summary Models for Routing Keywords to Linked Data Sources
             In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
             China, November, 2010 (full research paper, 20% acceptance rate).
            Günter Ladwig, Thanh Tran
             Linked Data Query Processing Strategies
             In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai,
             China, November, 2010 (full research paper, 20% acceptance rate).
            Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer
             Ontology-based Interpretation of Keywords for Semantic Search
             In Proceedings of the 6th International Semantic Web Conference (ISWC'07), pp. 523-
             536. Busan, Korea, November 2007 (full paper, 19% acceptance rate).




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
40
References (2)
            Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano
             Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF
             In Proceedings of the 25th International Conference on Data Engineering
             (ICDE'09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).
            Haofen Wang, Duc Thanh Tran, Chang Liu
             CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support
             In Proceedings of the 17th Conference on Information and Knowledge Management
             (CIKM'08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).
            Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu,
             Yue Pan
             Semplore: A Scalable IR Approach to Search the Web of Data
             In Journal of Web Semantics, 2009 (Impact Factor 3.4).
            Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu
             Snippet Generation for Semantic Web Search Engines
             In Proceedings of the 3rd Asian Semantic Web Conference (ASWC'08). December
             2008 (full research paper, 31% acceptance rate).
            Thanh Tran, Günter Ladwig
             Structure Index for RDF
             In SemData@VLDB Workshop (SemData'10). Singapore, September, 2010.

     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
41
Thanks!




                                                                                                                            Tran Duc Thanh
                                                                                                                     ducthanh.tran@kit.edu
                                                                                                 http://sites.google.com/site/kimducthanh/


     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
42
Backups




     KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
43
        Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search
         over relational databases. In ICDE, pages 5-16.
        Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and
         opportunities. In VLDB, page 1368.
        Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance
         oriented ranking. In ICDE, pages 517-528.
        Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching
         and Browsing in Databases using BANKS. In ICDE, pages 431-440.
        Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using
         Relevance Models. In CIKM.
        Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009):
         DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165
        Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data
         graphs. PVLDB, 1(1):1189-1204.
        Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost
         connected trees in databases. In ICDE, pages 836-845.
        Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data
         graphs. In SIGMOD, pages 927-940.
        Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search
         over XML documents. In SIGMOD.
        He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In
         SIGMOD, pages 305-316.
        Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in
         databases. ACM Trans. Database Syst., 33(1):1-40

    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
        Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases.
         In VLDB.
        Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005).
         Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.
        Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword
         proximity search. In PODS, pages 173-182.
        Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword
         Search Databases. In CIKM.
        Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127.
        Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search
         method for unstructured, semi-structured and structured data. In SIGMOD.
        Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational
         databases. In SIGMOD, pages 563-574.
        Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational
         databases. In SIGMOD, pages 115-126.
        Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD,
         pages 681-694.
        Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across
         heterogeneous relational databases. In ICDE, pages 346-355.
        Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search
         Process. In Journal of Web Semantics, 2011.
        Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates
         for Efficient Keyword Search on RDF. In ICDE.
        Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over
         relational databases. In VLDB.
    KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

Mais conteúdo relacionado

Destaque

Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в ИталиюEasyWays
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Thanh Tran
 
Summary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesSummary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesThanh Tran
 
Graphinder semantic search
Graphinder semantic searchGraphinder semantic search
Graphinder semantic searchThanh Tran
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing StrategiesThanh Tran
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesThanh Tran
 
Big data search
Big data search Big data search
Big data search Thanh Tran
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesThanh Tran
 
Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsThanh Tran
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebThanh Tran
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриковAndronovaAnna
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...Thanh Tran
 

Destaque (12)

Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в Италию
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
Summary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data SourcesSummary Models for Routing Keywords to Linked Data Sources
Summary Models for Routing Keywords to Linked Data Sources
 
Graphinder semantic search
Graphinder semantic searchGraphinder semantic search
Graphinder semantic search
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing Strategies
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search Databases
 
Big data search
Big data search Big data search
Big data search
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the Web
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриков
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
 

Último

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991RKavithamani
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdfssuser54595a
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 

Último (20)

Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
Industrial Policy - 1948, 1956, 1973, 1977, 1980, 1991
 
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
18-04-UA_REPORT_MEDIALITERAСY_INDEX-DM_23-1-final-eng.pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Hybridoma Technology ( Production , Purification , and Application )
Hybridoma Technology  ( Production , Purification , and Application  ) Hybridoma Technology  ( Production , Purification , and Application  )
Hybridoma Technology ( Production , Purification , and Application )
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 

Semantic Web Search - Searching Documents and Semantic Data on the Web

  • 1. Semantic Web Search Searching Documents and Semantic Data on the Web Presentation at Information Sciences Institute, USC Semantic Search Group at the AIFB Institute Thanh Tran, Günter Ladwig, Daniel M. Herzig, Andreas Wagner, Veli Bicer, Yongtao Ma and Rudi Studer. http://sites.google.com/site/kimducthanh KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 1
  • 2. Structure • Motivation • Previous and current work • Keyword query processing • Keyword query result ranking • Conclusion KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 2
  • 3. Besides documents, there is an increasing amount of structured data on the Web such as RDF, RDFa and Linked Data! How can we leverage this for enhancing the search experience? MOTIVATION KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 3
  • 4. RDFa … <div about="/alice/posts/trouble_with_bob"> <h2 property="dc:title">The trouble with Bob</h2> <h3 property="dc:creator">Alice</h3> Bob is a good friend of mine. We went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: <div about="http://example.com/bob/photos/sunset.jpg"> <img src="http://example.com/bob/photos/sunset.jpg" /> <span property="dc:title">Beautiful Sunset</span> by <span property="dc:creator">Bob</span>. </div> </div> … adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 4
  • 5. RDFa Bob is a good friend of mine. We content went to the same university, and also shared an apartment in Berlin in 2008. The trouble with Bob is that he takes much better photos than I do: content adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 5
  • 6. Semantic Data source: http://linkeddata.org/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 6
  • 7. Linked Data adopted from : http://www.w3.org/TR/xhtml-rdfa-primer/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 7
  • 8. Addressing Complex Information Needs  “Information about a friend of Alice, who shared an apartment with her in Berlin and knows someone in the field of Semantic Search working at KIT”. <shared apartment in Berlin with Alice> <knows someone in the field of Semantic <friend of Alice> Search working at KIT> trouble with bob FluidOps 34 Peter sunset.jpg Bob is a good friend Beautiful of mine. We went to Sunset the same university, Germany Semantic Alice Search and also shared an apartment in Berlin in 2008. The trouble with Bob is that he Germany 2009 Bob takes much better Thanh photos than I do: KIT KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 9
  • 9. Data Sources in SemanticSearch@AIFB Demo  English Wikipedia  Data from Linked Open Data  DBpedia  YAGO  Many more  Live data from Data.gov (US Government)  E.g. live data about earthquakes KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 10
  • 10. Search Intent Interpretation, Refinement and Exploration Keywords Query Completions Term Completions Facets Vorlesung Knowledge Discovery - Institut AIFB KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 13
  • 11. Result Inspection, Analysis and Browsing KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 14
  • 12. OVERVIEW OF WORK KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 15
  • 13. Search Concepts  Hybrid Search: Structured queries combined with keywords on structured and unstructured data in possibly remote (Linked Data) sources BACK-END  Query interpretation: Translation of keywords to hybrid queries  Keyword search (translated hybrid query) combined with faceted search: starting with keywords and then iterative refinement process based on operations on facets FRONT-END KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 16
  • 14. Previous and Current Work  Semi-structured RDF data management [ISWC09] [TKDE12]  Inverted index for RDF data management  Structure index  Linked data management [ESWC10][ISWC10] [ESWC11][ISWC11]  Keyword query routing to find relevant sources / relevant combination of sources  “Explorative” query processing and adaptive query optimization  Combining local and remote Linked Data  Search frontends [ICDE09][CIKM11] [SIGIR11][ISWC2011] [Dexa11]  Ontology and entity result summarization  Faceted and keyword search  Current work: hybrid data search KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 17 Tran Thanh: Schema-agnostic Search
  • 15. KEYWORD QUERY PROCESSING [ICDE09] KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 18
  • 16. DB-style Keyword Search Keyword query processing / translation “Articles of researchers at Stanford with Turing Award” „Stanford Article Turing Award“ Specification  Keywords might produce large number of matching elements in the data graph  The data graph might be large in size  Search complexity increases substantially with the size of the graph  Large number of results Selection Set of Queries Set of Results 1) Query 1 1) Result 1 2) Query 2 2) Result 2 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 19
  • 17. Query Space Schema graph Query space  Main Idea  Exploration on much reduced the data graph model  Query space: more compact representation of summary  Online constructionspace space out of schema graph called query of query  Match keywords against labels of resources to find keyword elements  Substantially elements with elements of schema to obtain query space  Connect keyword decrease complexity  Top-k procedure for graph exploration to compute  Online top-k query graph exploration only top-k results KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 20
  • 18. Top-k Query Graph Exploration on Query Space Paths and their costs The resulting query graph • Cost-directed exploration of Steiner graphs • Explore all possible distinct paths starting from keyword elements • At each exploration, take current path with lowest cost • When a connecting element is found, merge paths to construct the query graph and add it to candidate list • Top-k terminates when highest cost of the candidate list (the cost of the k- ranked query graph) is found to be lower than the lowest possible cost that can achieved with paths in the queues yet to be explored • Result: best k query interpretations to be shown to the user KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 21
  • 19. Evaluation – Performance • Comparison with bidirectional search [V. Kacholia et al.] and search based on graph indexing (1000 BFS, 1000 METIS, 300 BFS, 300 METIS in [H. He et al.]) • Query computation + processing time until finding 10 answers • Outperforms bidirectional search by at least one order of magn. • Performance comparable with indexing based approaches, but requires less space 100000 10000 Our Solution 1000 Bidirect 1000 BFS 100 1000 METIS 10 300BFS 1 300METIS Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Query Performance on DBLP Data KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 22
  • 20. KEYWORD QUERY RESULT RANKING [CIKM11] KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 23
  • 21. IR-based Ranking Schemes  TF*IDF based:  Discover, EASE, SPARK  [Liu et al, SIGMOD06] Score( JRT ) Score( r ) r JRT Score(r ) Weight (v, r ) Weight (v, Q) v r ,Q ntf Weight (v, r ) nidf ntf 1 ln(1 ln(tf )) ndl ndl (1 s) s dl / avdl N 1 nidf ln df 24 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 24
  • 22. Proximity-based Ranking Schemes  EASE, XRANK, BLINKS, etc.  EASE  Proximity between a pair of keywords  Overall score of a JRT is aggregation on the score of keyword pairs  XRANK  Ranking of XML documents / elements  Proximity here is defined based on w, the smallest text window in n that contains all search keywords KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 25
  • 23. Prestige-based Ranking Schemes  Based on graph structure, i.e. PageRank-like methods to determine node prestige  XRank [Guo et al, SIGMOD03]  ObjectRank [Balmin et al, VLDB04] : considers both global ObjectRank and keyword-specific ObjectRank  The probability that edges of different types will be visited are not uniform: requires manual fine-tuning to set the importance of different types of edges  Naive: indegree KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 26
  • 24. Introduction  Recent study shows that the effectiveness of most works are below the expectations (Coffman and Weaver, CIKM 2010)  Problems:  Proximity does not directly model relevance  Ad-hoc TF/IDF normalization does not capture the nature of keyword search results well (small document length, skewed word occurrence statistics)  PageRank not directly applicable KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 27
  • 25. Overview of the Approach  Keyword query is short an ambiguous, while data (and results) provide rich structure information that can be exploited!  Principled approach to relevance based on language models and PRF  estimate model from content and structure of PRF results  Adopt relevance model as a fine-grained model representing both content and structure of relevant document and queries (relevance class) KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 28
  • 26. Relevance Models [SIGIR 01]  Explicit notion of relevance  Queries and documents are samples from a latent representation space, i.e. the relevance model underlying the information need KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 29
  • 27. Relevance Models q1 Israeli sample probabilities P(w|Q) w M q2 Palestinian .077 palestinian .055 israel M q3 raids .034 jerusalem M .033 protest w ??? .027 raid .011 clash .010 bank P( w, q1...qk ) .010 west P( w | R) P( w | q1...qk ) .010 troop P(q1...qk ) … k P ( w, q1...qk ) P( M ) P( w | M ) P (qi | M ) M UM i 1 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 30
  • 28. Ranking with Relevance Models  Probability ranking principle P( D | R) P( w | R) P( D | N ) w D P( w | N )  See relevance model as query expansion  Rank of document is based on the cross-entropy of its model and the relevance model H ( R || D) P ( w | R) log P( w | D) w V n( w, D) P( w | D) D (1 D ) P( w | C ) |D| KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 31
  • 29. Edge-Specific Relevance Models  Given a query Q={q1,…,qn}, a set of PRF resources are retrieved from an inverted keyword index:  E.g. Q={Hepburn, Holiday}, FR = {m1, p1, p4,m2, p2m2,m3}  Based on PRF results, an edge specific relevance model is constructed for each unique edge e based on: KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 32
  • 30. Edge Specific Resource Models  Edge-specific resource model:  Smoothing with model for the entire resource  The score of a resource calculated based on cross-entropy of edge-specific RM and edge-specific ResM:  Alpha allows to control the importance of edges KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 33
  • 31. Ranking JRTs  Ranking aggregated JRTs:  The cross entropy between the edge-specific RM (Query Model) and geometric mean of combined edge-specific ResM:  The proposed ranking function is monotonic with respect to the individual resource scores (a necessary property for using top-k algorithms) KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 34
  • 32. Experiments  Datasets: Subsets of Wikipedia, IMDB and Mondial Web databases  Queries: 50 queries for each dataset including “TREC style” queries and “single resource” queries  Metrics: Three metrics are used: (1) the number of top-1 relevant results, (2) Reciprocal rank and (3) Mean Average Precision (MAP)  Baselines: BANKS , Bidirectional (proximity) , Efficient , SPARK, CoveredDensity (TF-IDF).  RM-S: Our approach KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 35
  • 33. Experiments – Single Resource Queries - Proximity-based approaches perform well - Minimizing compactness results in single resources being ranked high - TF-IDF normalization not as aggressive, not as effective Reciprocal rank for single resource queries KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 36
  • 34. Experiments – TREC-style Queries - TF-IDF based approaches performed better - Our approach outperformed existing approaches also in this category, providing more stable performance over the entire precision-recall curve Precision-recall for TREC-style queries on Wikipedia KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 37
  • 35. Experiment – All Queries - Our approach consistently shows superior performance - Encouraging, given that this is first study that use a general framework for evaluating keyword search ranking MAP scores for all queries KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 38
  • 36. Conclusions / Future Work  Front-to-backend work on using structured data for enhancing the search experience  From backend data management to frontend search concepts  Current work / future directions  Managing hybrid data  Hybrid query processing / interfaces  Ranking hybrid results KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 39
  • 37. References (1)  Günter Ladwig, Thanh Tran SIHJoin: Querying Remote and Local Linked Data In 8th Extended Semantic Web Conference (ESWC'11). Heraklion, Greece, June, 2011 (full research paper, 23% acceptance rate).  Thanh Tran, Lei Zhang, Rudi Studer Summary Models for Routing Keywords to Linked Data Sources In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).  Günter Ladwig, Thanh Tran Linked Data Query Processing Strategies In Proceedings of 9th International Semantic Web Conference (ISWC'10). Shanghai, China, November, 2010 (full research paper, 20% acceptance rate).  Duc Thanh Tran, Philipp Cimiano, Sebastian Rudolph, Rudi Studer Ontology-based Interpretation of Keywords for Semantic Search In Proceedings of the 6th International Semantic Web Conference (ISWC'07), pp. 523- 536. Busan, Korea, November 2007 (full paper, 19% acceptance rate). KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 40
  • 38. References (2)  Duc Thanh Tran, Haofen Wang, Sebastian Rudolph, Philipp Cimiano Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF In Proceedings of the 25th International Conference on Data Engineering (ICDE'09). Shanghai, China, March 2009 (full research paper, 17% acceptance rate).  Haofen Wang, Duc Thanh Tran, Chang Liu CE2 - Towards a Large Scale Hybrid Search Engine with Integrated Ranking Support In Proceedings of the 17th Conference on Information and Knowledge Management (CIKM'08). Napa Valley, USA, October 2008 (poster paper, 16% acceptance rate).  Haofen Wang, Qiaoling Liu, Thomas Penin, Linyun Fu, Lei Zhang, Thanh Tran, Yong Yu, Yue Pan Semplore: A Scalable IR Approach to Search the Web of Data In Journal of Web Semantics, 2009 (Impact Factor 3.4).  Thomas Penin, Haofen Wang, Duc Thanh Tran, Yong Yu Snippet Generation for Semantic Web Search Engines In Proceedings of the 3rd Asian Semantic Web Conference (ASWC'08). December 2008 (full research paper, 31% acceptance rate).  Thanh Tran, Günter Ladwig Structure Index for RDF In SemData@VLDB Workshop (SemData'10). Singapore, September, 2010. KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 41
  • 39. Thanks! Tran Duc Thanh ducthanh.tran@kit.edu http://sites.google.com/site/kimducthanh/ KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 42
  • 40. Backups KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) 43
  • 41. Agrawal, S., Chaudhuri, S., and Das, G. (2002). DBXplorer: A system for keyword-based search over relational databases. In ICDE, pages 5-16.  Amer-Yahia, S. and Shanmugasundaram, J. (2005). XML full-text search: Challenges and opportunities. In VLDB, page 1368.  Bao, Z., Ling, T. W., Chen, B., and Lu, J. (2009). Effective xml keyword search with relevance oriented ranking. In ICDE, pages 517-528.  Bhalotia, G., Nakhe, C., Hulgeri, A., Chakrabarti, S., and Sudarshan, S. (2002). Keyword Searching and Browsing in Databases using BANKS. In ICDE, pages 431-440.  Bicer, V., Tran, T. (2011): Ranking Support for Keyword Search on Structured Data using Relevance Models. In CIKM.  Bizer, G., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S. (2009): DBpedia - A crystallization point for the Web of Data. J. Web Sem. (WS) 7(3):154-165  Dalvi, B. B., Kshirsagar, M., and Sudarshan, S. (2008). Keyword search on external memory data graphs. PVLDB, 1(1):1189-1204.  Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., and Lin, X. (2007). Finding top-k min-cost connected trees in databases. In ICDE, pages 836-845.  Golenberg, K., Kimelfeld, B., and Sagiv, Y. (2008). Keyword proximity search in complex data graphs. In SIGMOD, pages 927-940.  Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. (2003). XRANK: Ranked keyword search over XML documents. In SIGMOD.  He, H., Wang, H., Yang, J., and Yu, P. S. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pages 305-316.  Hristidis, V., Hwang, H., and Papakonstantinou, Y. (2008). Authority-based keyword search in databases. ACM Trans. Database Syst., 33(1):1-40 KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)
  • 42. Hristidis, V. and Papakonstantinou, Y. (2002). Discover: Keyword search in relational databases. In VLDB.  Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., and Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505-516.  Kimelfeld, B. and Sagiv, Y. (2006). Finding and approximating top-k answers in keyword proximity search. In PODS, pages 173-182.  Ladwig, G., Tran, T. (2011): Index Structures and Top-k Join Algorithms for Native Keyword Search Databases. In CIKM.  Lavrenko, V. Croft, W.B. (2001): Relevance-Based Language Models. In SIGIR, pages 120-127.  Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. (2008). EASE: an effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In SIGMOD.  Liu, F., Yu, C., Meng, W., and Chowdhury, A. (2006). Effective keyword search in relational databases. In SIGMOD, pages 563-574.  Luo, Y., Lin, X., Wang, W., and Zhou, X. (2007). SPARK: Top-k keyword query in relational databases. In SIGMOD, pages 115-126.  Qin, L., Yu J. X., Chang, L. (2009) Keyword search in databases: the power of RDBMS. In SIGMOD, pages 681-694.  Sayyadian, M., LeKhac, H., Doan, A., and Gravano, L. (2007). Efficient keyword search across heterogeneous relational databases. In ICDE, pages 346-355.  Tran, T., Herzig, D., Ladwig, G. (2011): SemSearchPro: Using Semantics throughout the Search Process. In Journal of Web Semantics, 2011.  Tran, T., Wang, H., Rudolph, S., Cimiano, P. (2009): Top-k Exploration of Query Graph Candidates for Efficient Keyword Search on RDF. In ICDE.  Vagelis Hristidis, L. G. and Papakonstantinou, Y. (2003). Efficient ir-style keyword search over relational databases. In VLDB. KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

Notas do Editor

  1. Web data: Text+ Linked Data+ Semi-structured RDF+ Hybrid datathat can be conceived as forming data graphsHear abour bob and alice all the time (in computer science literatures), want to find out more… build Semantic Web search engine. To address complex information needs by exploiting Web data:- Information need interpreted as a set of constrains Match structured data Match text
  2. Togive an impressionwherewearetowardsaccomplishingthisgoal: demofirstOurcurrentsystem: Support theprocessofaddressingcomplexinformationneeds: startswithkeywordsearch: intepretingthequeryintentandthenbrowsing / exploration / refinementofresultsset via facetedsearch
  3. - Upon selecting a specificresult: resource-basenavigation (insteadoffacetedbased)
  4. TF-idf are used to deal with the textual part of the dataPropose to also exploit the structure of keyword search resultsProximity-based ranking employ minimal distance heuristics to maximize structural compactness of results When JRT is more compact, it is assumed to be more meaningful and relevant Intuition: keyword specified by the users are closely related and thus should be connected over relatively short paths I.e. Compactness measured in terms of the length of paths between nodes, i.e. The proximity The larger the length of paths, the less relevant is the overall resultNi and nj are nodes in the graph sim(ni,nj) denotes the compactness between two any nodessim(ki,kj) denotes the compactness between two keywords (taking account the compactness of all pairs of nodes matching the two keywords), i.e. Cki denotes the set of all nodes that match kiOverall score of a JRT is an aggregation on the score of its
  5. Schemas = summaries