LDBC Semantic Publishing Benchmark evolution. The Semantic Publishing Benchmark (SPB) needed to evolve in order to allow for retrieval of semantically relevant content based on rich metadata descriptions and diverse reference knowledge and demonstrate that triplestores offer simplified and more efficient querying proving it to be a challenging technology in the RDF and Graph arena. Read in the presentations all the changes.
The document discusses the key features and capabilities of search in SharePoint 2013, including personalized search results, continuous crawling, query rules to customize search results, managed metadata and refiners to improve relevance, and analytics to track search usage and improve recommendations. It provides details on result sources, query rules, display templates, and various analytics features to enhance the search experience.
Entity Search on Virtual Documents Created with Graph EmbeddingsSease
The document summarizes Anna Ruggero's presentation on entity search using graph embeddings. The presentation introduced an approach to entity search that creates virtual documents combining related entities based on their embeddings. Entities from DBPedia were represented as nodes in a graph and Node2Vec was used to generate embeddings. The embeddings were clustered to form documents, which were ranked using BM25. Systems that combined or fused the document rankings and entity rankings were evaluated on entity search tasks and found relevant entities that traditional methods did not.
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
Every information retrieval practitioner ordinarily struggles with the task of evaluating how well a search engine is performing and to reproduce the performance achieved in a specific point in time.
Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
Additionally it is extremely important to track the evolution of the search system in time and to be able to reproduce and measure the same performance (through metrics of interest such as precison@k, recall, NDCG@k...).
The talk will describe the Rated Ranking Evaluator from a researcher and software engineer perspective.
RRE is an open source search quality evaluation tool, that can be used to produce a set of reports about the quality of a system, iteration after iteration and that could be integrated within a continuous integration infrastructure to monitor quality metrics after each release .
Focus of the talk will be to raise public awareness of the topic of search quality evaluation and reproducibility describing how RRE could help the industry.
The document discusses the new architecture for search in SharePoint 2013. It introduces the main components of the 2013 search architecture including the crawl component, content processing component, analytics processing component, index component, query processing component, and admin component. It provides an overview of the key changes and improvements in 2013 such as continuous crawl, redundancy of the admin component, and improved management using PowerShell. The document also covers upgrade considerations and limitations when migrating from SharePoint 2010 search.
Every team working on information retrieval software struggles with the task of evaluating how well their system performs in terms of search quality(currently and historically). Evaluating search quality is important both to understand and size the improvement or regression of your search application across the development cycles, and to communicate such progress to relevant stakeholders. In the industry, and especially in the open source community, the landscape is quite fragmented: such requirements are often achieved using ad-hoc partial solutions that each time require a considerable amount of development and customization effort. To provide a standard, unified and approachable technology, we developed the Rated Ranking Evaluator (RRE), an open source tool for evaluating and measuring the search quality of a given search infrastructure. RRE is modular, compatible with multiple search technologies and easy to extend.
COUNTER R4 to R5 - transition and comparison with JUSP - updatedJUSPSTATS
This document outlines a session on the transition from COUNTER R4 to R5 usage reporting standards and the new reports available in JUSP. It discusses JUSP transitioning their reports to R5, challenges libraries are facing understanding and working with R5 data, new report features in JUSP to help with the transition like comparing R4 and R5 metrics, and where to find training and support resources.
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
Every team working on Information Retrieval software struggles with the task of evaluating how well their system performs in terms of search quality(at a specific point in time and historically).
Evaluating search quality is important both to understand and size the improvement or regression of your search application across the development cycles, and to communicate such progress to relevant stakeholders.
To satisfy these requirements an helpful tool must be:
flexible and highly configurable for a technical user
immediate, visual and concise for an optimal business utilization
In the industry, and especially in the open source community, the landscape is quite fragmented: such requirements are often achieved using ad-hoc partial solutions that each time require a considerable amount of development and customization effort.
To provide a standard, unified and approachable technology, we developed the Rated Ranking Evaluator (RRE), an open source tool for evaluating and measuring the search quality of a given search infrastructure. RRE is modular, compatible with multiple search technologies and easy to extend. It is composed by a core library and a set of modules and plugins that give it the flexibility to be integrated in automated evaluation processes and in continuous integrations flows.
This talk will introduce RRE, it will describe its latest developments and demonstrate how it can be integrated in a project to measure and assess the search quality of your search application.
The focus of the presentation will be on a live demo showing an example project with a set of initial relevancy issues that we will solve iteration after iteration: using RRE output feedbacks to gradually drive the improvement process until we reach an optimal balance between quality evaluation measures.
The document discusses the key features and capabilities of search in SharePoint 2013, including personalized search results, continuous crawling, query rules to customize search results, managed metadata and refiners to improve relevance, and analytics to track search usage and improve recommendations. It provides details on result sources, query rules, display templates, and various analytics features to enhance the search experience.
Entity Search on Virtual Documents Created with Graph EmbeddingsSease
The document summarizes Anna Ruggero's presentation on entity search using graph embeddings. The presentation introduced an approach to entity search that creates virtual documents combining related entities based on their embeddings. Entities from DBPedia were represented as nodes in a graph and Node2Vec was used to generate embeddings. The embeddings were clustered to form documents, which were ranked using BM25. Systems that combined or fused the document rankings and entity rankings were evaluated on entity search tasks and found relevant entities that traditional methods did not.
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
Every information retrieval practitioner ordinarily struggles with the task of evaluating how well a search engine is performing and to reproduce the performance achieved in a specific point in time.
Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
Additionally it is extremely important to track the evolution of the search system in time and to be able to reproduce and measure the same performance (through metrics of interest such as precison@k, recall, NDCG@k...).
The talk will describe the Rated Ranking Evaluator from a researcher and software engineer perspective.
RRE is an open source search quality evaluation tool, that can be used to produce a set of reports about the quality of a system, iteration after iteration and that could be integrated within a continuous integration infrastructure to monitor quality metrics after each release .
Focus of the talk will be to raise public awareness of the topic of search quality evaluation and reproducibility describing how RRE could help the industry.
The document discusses the new architecture for search in SharePoint 2013. It introduces the main components of the 2013 search architecture including the crawl component, content processing component, analytics processing component, index component, query processing component, and admin component. It provides an overview of the key changes and improvements in 2013 such as continuous crawl, redundancy of the admin component, and improved management using PowerShell. The document also covers upgrade considerations and limitations when migrating from SharePoint 2010 search.
Every team working on information retrieval software struggles with the task of evaluating how well their system performs in terms of search quality(currently and historically). Evaluating search quality is important both to understand and size the improvement or regression of your search application across the development cycles, and to communicate such progress to relevant stakeholders. In the industry, and especially in the open source community, the landscape is quite fragmented: such requirements are often achieved using ad-hoc partial solutions that each time require a considerable amount of development and customization effort. To provide a standard, unified and approachable technology, we developed the Rated Ranking Evaluator (RRE), an open source tool for evaluating and measuring the search quality of a given search infrastructure. RRE is modular, compatible with multiple search technologies and easy to extend.
COUNTER R4 to R5 - transition and comparison with JUSP - updatedJUSPSTATS
This document outlines a session on the transition from COUNTER R4 to R5 usage reporting standards and the new reports available in JUSP. It discusses JUSP transitioning their reports to R5, challenges libraries are facing understanding and working with R5 data, new report features in JUSP to help with the transition like comparing R4 and R5 metrics, and where to find training and support resources.
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
Every team working on Information Retrieval software struggles with the task of evaluating how well their system performs in terms of search quality(at a specific point in time and historically).
Evaluating search quality is important both to understand and size the improvement or regression of your search application across the development cycles, and to communicate such progress to relevant stakeholders.
To satisfy these requirements an helpful tool must be:
flexible and highly configurable for a technical user
immediate, visual and concise for an optimal business utilization
In the industry, and especially in the open source community, the landscape is quite fragmented: such requirements are often achieved using ad-hoc partial solutions that each time require a considerable amount of development and customization effort.
To provide a standard, unified and approachable technology, we developed the Rated Ranking Evaluator (RRE), an open source tool for evaluating and measuring the search quality of a given search infrastructure. RRE is modular, compatible with multiple search technologies and easy to extend. It is composed by a core library and a set of modules and plugins that give it the flexibility to be integrated in automated evaluation processes and in continuous integrations flows.
This talk will introduce RRE, it will describe its latest developments and demonstrate how it can be integrated in a project to measure and assess the search quality of your search application.
The focus of the presentation will be on a live demo showing an example project with a set of initial relevancy issues that we will solve iteration after iteration: using RRE output feedbacks to gradually drive the improvement process until we reach an optimal balance between quality evaluation measures.
19 09-26 source reconfiguration and august release featuresDede Sabey
This document summarizes a meeting agenda and presentation on Pardot's August release features.
The agenda includes discussions on populating Pardot's source field, August release features in depth, learning and development, and Q&A. The presentation covers Pardot's new Handlebars merge language, automated resubscribe emails, Engagement Studio complex rules, and multiple tracker domains. It also provides guidance on setting up the new features and finding related help articles and resources.
The Geographic Support System Initiative aims to improve the Census Bureau's Master Address File and spatial data through increased funding and partnerships. It involves continuous updates to address and map data, with a focus on rural areas and Puerto Rico. Major players are the Census Bureau and its government and private sector partners. Challenges include the lack of national addressing standards and constraints on sharing address data due to Title 13 confidentiality rules. Ongoing work includes research, evaluation of current data and standards, and leveraging technology and partnerships to improve data quality and exchange.
The document provides an overview of the architecture for a developer analytics solution. It includes:
1. An overview of the key components and data flows, including extracting and aggregating data from various sources and loading it into data warehouses and cubes.
2. A demo of the developer portal and reports.
3. Details on the front-end architecture with multiple active-active front-end servers and an active-passive backend architecture with data warehouses and cubes.
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
This document discusses property graphs and how they are represented and queried using Morpheus, a graph query engine for Apache Spark.
Morpheus allows querying property graphs using Cypher and represents property graphs using DataFrames, with node and relationship data stored in tables. It integrates with various data sources and supports federated queries across multiple property graphs. The document provides examples of loading property graph data from sources like JSON, SQL databases and Neo4j, creating graph projections, running analytical queries, and recommending businesses based on graph algorithms.
A two day training session for colleagues at Aimia, to introduce them to R. Topics covered included basics of R, I/O with R, data analysis and manipulation, and visualisation.
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
Search quality evaluation is an ever-green topic every search engineer ordinarily struggles with. Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
The slides will focus on how a search quality evaluation tool can be seen under a practical developer perspective, how it could be used for producing a deliverable artifact and how it could be integrated within a continuous integration infrastructure.
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks
This document discusses how property graphs and Cypher queries will be brought to Apache Spark through the Spark Graph project. It provides an overview of property graphs and Cypher, the graph query language. It then demonstrates how a Cypher query would be executed in Spark Graph, including parsing and translating the query, optimizing the logical and physical plans, and executing it using Spark SQL and DataFrames. This will allow graph queries and algorithms to be run efficiently at scale using Spark's distributed processing capabilities.
The document discusses the UniProt SPARQL endpoint, which provides access to 20 billion biomedical data triples. It notes challenges in hosting large life science databases and standards-based federated querying. The endpoint uses Apache load balancing across two nodes each with 64 CPU cores and 256GB RAM. Loading rates of 500,000 triples/second are achieved. Usage peaks at 35 million queries per month from an estimated 300-2000 real users. Very large queries involving counting all IRIs can take hours. Template-based compilation and key-value approaches are proposed to optimize challenging queries. Public monitoring is also discussed.
Lighthouse: Large-scale graph pattern matching on GiraphLDBC council
This document provides an overview of Lighthouse, a system for large-scale graph pattern matching on Giraph. It discusses the timeline of Lighthouse's development, its vertex-centric API, architecture based on BSP and Pregel, and execution algebra using operations like scan, select, project, and joins. Examples are provided of matching graph patterns in Cypher queries and finding the k-shortest paths between nodes while restricting the search space. The implementation uses two phases, first computing the routes for each top k shortest path by discovering precedent vertices, then building the full paths by traversing backwards. Preliminary results are mentioned but not detailed.
LDBC 6th TUC Meeting conclusions by Peter BonczIoan Toma
The document discusses the conclusions and future perspectives of the Linked Data Benchmark Council (LDBC). It summarizes that the EU project is on track, four benchmarks were developed including for social networks and graph analytics, and there has been growing participation from academic and industry groups. It outlines plans for LDBC to transition to an independent council and develop additional benchmarks in areas like graph programming, temporal graphs, and ontology reasoning.
GRAPH-TA 2013 - RDF and Graph benchmarking - Jose Lluis Larriba PeyIoan Toma
The document discusses an agenda for a meeting on benchmarking RDF and graph databases. It provides an overview of the LDBC benchmarking project including its objectives to create standardized benchmarks, spur industry cooperation, and push technological improvements. It outlines the working groups and task forces that will focus on specific types of benchmarks. It also discusses common issues to be addressed like suitable use cases, benchmark methodologies, and benchmark workloads for querying, updating, and integrating RDF and graph data. Open questions are raised about benchmark realism, benchmark rules, and technological convergence of RDF and graph databases.
SADI: A design-pattern for “native” Linked-Data Semantic Web ServicesIoan Toma
Semantic Automated Discovery and Integration
A design-pattern for “native” Linked-Data
Semantic Web Services by Mark D. Wilkinson (Universidad Politécnica de Madrid)
The LDBC benchmark council develops industry-strength benchmarks for graph and RDF data management systems. It focuses on benchmarks for transactional updates, business intelligence queries, graph analytics, and complex RDF workloads. The benchmarks are designed based on choke points - difficulties in the workload that stimulate technical progress. Examples of choke points in TPC-H include dependent group by keys and sparse joins. A key choke point for graph benchmarks is structural correlation, where the structure of the graph depends on node attribute values. The LDBC benchmark includes a generator that produces synthetic graphs with correlated properties and edges to test how systems handle this challenge.
Social Network Benchmark Interactive WorkloadIoan Toma
The document summarizes the LDBC Social Network Benchmark (SNB) for interactive workloads on social network systems. It describes:
- The SNB data generator that produces synthetic social network data with characteristics similar to Facebook, including persons, groups, posts, and relationships. It scales to very large datasets.
- The types of queries in the benchmark - complex reads, short reads, and updates - that mimic real user interactions. Complex reads target specific system bottlenecks.
- The workload driver that schedules queries and updates over time based on the synthetic data, aiming to balance read and write loads and mimic user browsing behavior. It measures latency and throughput.
- Validation and execution rules to ensure fair
This document outlines the process for auditing an implementation of the LDBC Social Network Benchmark (SNB). It describes downloading the necessary driver and data generator software. The main metrics for benchmarking are throughput at scale and query latency constraints during crash recovery testing. It provides guidelines for implementing the driver, loading data, running queries for 2 hours, and verifying crash recovery. Audit results will report throughput, load time, recovery time and system configuration. Rules prohibit hinted query plans, precomputed indexes, and require serializable consistency.
The document discusses various benchmarks that are commonly used to evaluate Semantic Web repositories and their performance handling large amounts of RDF data. Some of the major benchmarks mentioned include the Lehigh University Benchmark (LUBM), Berlin SPARQL Benchmark (BSBM), SP2Bench, Social Network Intelligence Benchmark (SIB), and DBPedia SPARQL Benchmark. The document also provides an overview of different benchmark components and links to resources with performance results from various RDF stores and systems.
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
slides from the S4 webinar "On-Demand RDF Graph Databases in the Cloud"
RDF database-as-a-service running on the Self-Service Semantic Suite (S4) platform: http://s4.ontotext.com
video recording of the talk is available at http://info.ontotext.com/on-demand-rdf-graph-database
The Characteristics of a RESTful Semantic Web and Why They Are ImportantChimezie Ogbuji
1. The document discusses characteristics of a RESTful Semantic Web and why they are important. It proposes that graph URIs in an RDF dataset identify the meaning of their associated graph.
2. It suggests a model where REST components can interact with named graphs intuitively via their URIs, retrieving and updating representations of the graphs' meanings.
3. It acknowledges some exceptions, like URIs that are names rather than locations, and proposes embedding such URIs in resolvable parent URIs to enable REST interaction.
Selecting the right database type for your knowledge management needs.Synaptica, LLC
This presentation looks at relational vs. graph databases and their advantages and disadvantages in storing semantic data for taxonomies and ontologies.
LOD2 plenary meeting in Paris: presentation of WP2: State of Play (Storing and Querying Very Large Knowledge Bases) by Peter Boncz (CWI) and Orri Erling (OpenLink Software)
19 09-26 source reconfiguration and august release featuresDede Sabey
This document summarizes a meeting agenda and presentation on Pardot's August release features.
The agenda includes discussions on populating Pardot's source field, August release features in depth, learning and development, and Q&A. The presentation covers Pardot's new Handlebars merge language, automated resubscribe emails, Engagement Studio complex rules, and multiple tracker domains. It also provides guidance on setting up the new features and finding related help articles and resources.
The Geographic Support System Initiative aims to improve the Census Bureau's Master Address File and spatial data through increased funding and partnerships. It involves continuous updates to address and map data, with a focus on rural areas and Puerto Rico. Major players are the Census Bureau and its government and private sector partners. Challenges include the lack of national addressing standards and constraints on sharing address data due to Title 13 confidentiality rules. Ongoing work includes research, evaluation of current data and standards, and leveraging technology and partnerships to improve data quality and exchange.
The document provides an overview of the architecture for a developer analytics solution. It includes:
1. An overview of the key components and data flows, including extracting and aggregating data from various sources and loading it into data warehouses and cubes.
2. A demo of the developer portal and reports.
3. Details on the front-end architecture with multiple active-active front-end servers and an active-passive backend architecture with data warehouses and cubes.
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
This document discusses property graphs and how they are represented and queried using Morpheus, a graph query engine for Apache Spark.
Morpheus allows querying property graphs using Cypher and represents property graphs using DataFrames, with node and relationship data stored in tables. It integrates with various data sources and supports federated queries across multiple property graphs. The document provides examples of loading property graph data from sources like JSON, SQL databases and Neo4j, creating graph projections, running analytical queries, and recommending businesses based on graph algorithms.
A two day training session for colleagues at Aimia, to introduce them to R. Topics covered included basics of R, I/O with R, data analysis and manipulation, and visualisation.
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
Search quality evaluation is an ever-green topic every search engineer ordinarily struggles with. Improving the correctness and effectiveness of a search system requires a set of tools which help measuring the direction where the system is going.
The slides will focus on how a search quality evaluation tool can be seen under a practical developer perspective, how it could be used for producing a deliverable artifact and how it could be integrated within a continuous integration infrastructure.
Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...Databricks
This document discusses how property graphs and Cypher queries will be brought to Apache Spark through the Spark Graph project. It provides an overview of property graphs and Cypher, the graph query language. It then demonstrates how a Cypher query would be executed in Spark Graph, including parsing and translating the query, optimizing the logical and physical plans, and executing it using Spark SQL and DataFrames. This will allow graph queries and algorithms to be run efficiently at scale using Spark's distributed processing capabilities.
The document discusses the UniProt SPARQL endpoint, which provides access to 20 billion biomedical data triples. It notes challenges in hosting large life science databases and standards-based federated querying. The endpoint uses Apache load balancing across two nodes each with 64 CPU cores and 256GB RAM. Loading rates of 500,000 triples/second are achieved. Usage peaks at 35 million queries per month from an estimated 300-2000 real users. Very large queries involving counting all IRIs can take hours. Template-based compilation and key-value approaches are proposed to optimize challenging queries. Public monitoring is also discussed.
Lighthouse: Large-scale graph pattern matching on GiraphLDBC council
This document provides an overview of Lighthouse, a system for large-scale graph pattern matching on Giraph. It discusses the timeline of Lighthouse's development, its vertex-centric API, architecture based on BSP and Pregel, and execution algebra using operations like scan, select, project, and joins. Examples are provided of matching graph patterns in Cypher queries and finding the k-shortest paths between nodes while restricting the search space. The implementation uses two phases, first computing the routes for each top k shortest path by discovering precedent vertices, then building the full paths by traversing backwards. Preliminary results are mentioned but not detailed.
LDBC 6th TUC Meeting conclusions by Peter BonczIoan Toma
The document discusses the conclusions and future perspectives of the Linked Data Benchmark Council (LDBC). It summarizes that the EU project is on track, four benchmarks were developed including for social networks and graph analytics, and there has been growing participation from academic and industry groups. It outlines plans for LDBC to transition to an independent council and develop additional benchmarks in areas like graph programming, temporal graphs, and ontology reasoning.
GRAPH-TA 2013 - RDF and Graph benchmarking - Jose Lluis Larriba PeyIoan Toma
The document discusses an agenda for a meeting on benchmarking RDF and graph databases. It provides an overview of the LDBC benchmarking project including its objectives to create standardized benchmarks, spur industry cooperation, and push technological improvements. It outlines the working groups and task forces that will focus on specific types of benchmarks. It also discusses common issues to be addressed like suitable use cases, benchmark methodologies, and benchmark workloads for querying, updating, and integrating RDF and graph data. Open questions are raised about benchmark realism, benchmark rules, and technological convergence of RDF and graph databases.
SADI: A design-pattern for “native” Linked-Data Semantic Web ServicesIoan Toma
Semantic Automated Discovery and Integration
A design-pattern for “native” Linked-Data
Semantic Web Services by Mark D. Wilkinson (Universidad Politécnica de Madrid)
The LDBC benchmark council develops industry-strength benchmarks for graph and RDF data management systems. It focuses on benchmarks for transactional updates, business intelligence queries, graph analytics, and complex RDF workloads. The benchmarks are designed based on choke points - difficulties in the workload that stimulate technical progress. Examples of choke points in TPC-H include dependent group by keys and sparse joins. A key choke point for graph benchmarks is structural correlation, where the structure of the graph depends on node attribute values. The LDBC benchmark includes a generator that produces synthetic graphs with correlated properties and edges to test how systems handle this challenge.
Social Network Benchmark Interactive WorkloadIoan Toma
The document summarizes the LDBC Social Network Benchmark (SNB) for interactive workloads on social network systems. It describes:
- The SNB data generator that produces synthetic social network data with characteristics similar to Facebook, including persons, groups, posts, and relationships. It scales to very large datasets.
- The types of queries in the benchmark - complex reads, short reads, and updates - that mimic real user interactions. Complex reads target specific system bottlenecks.
- The workload driver that schedules queries and updates over time based on the synthetic data, aiming to balance read and write loads and mimic user browsing behavior. It measures latency and throughput.
- Validation and execution rules to ensure fair
This document outlines the process for auditing an implementation of the LDBC Social Network Benchmark (SNB). It describes downloading the necessary driver and data generator software. The main metrics for benchmarking are throughput at scale and query latency constraints during crash recovery testing. It provides guidelines for implementing the driver, loading data, running queries for 2 hours, and verifying crash recovery. Audit results will report throughput, load time, recovery time and system configuration. Rules prohibit hinted query plans, precomputed indexes, and require serializable consistency.
The document discusses various benchmarks that are commonly used to evaluate Semantic Web repositories and their performance handling large amounts of RDF data. Some of the major benchmarks mentioned include the Lehigh University Benchmark (LUBM), Berlin SPARQL Benchmark (BSBM), SP2Bench, Social Network Intelligence Benchmark (SIB), and DBPedia SPARQL Benchmark. The document also provides an overview of different benchmark components and links to resources with performance results from various RDF stores and systems.
On-Demand RDF Graph Databases in the CloudMarin Dimitrov
slides from the S4 webinar "On-Demand RDF Graph Databases in the Cloud"
RDF database-as-a-service running on the Self-Service Semantic Suite (S4) platform: http://s4.ontotext.com
video recording of the talk is available at http://info.ontotext.com/on-demand-rdf-graph-database
The Characteristics of a RESTful Semantic Web and Why They Are ImportantChimezie Ogbuji
1. The document discusses characteristics of a RESTful Semantic Web and why they are important. It proposes that graph URIs in an RDF dataset identify the meaning of their associated graph.
2. It suggests a model where REST components can interact with named graphs intuitively via their URIs, retrieving and updating representations of the graphs' meanings.
3. It acknowledges some exceptions, like URIs that are names rather than locations, and proposes embedding such URIs in resolvable parent URIs to enable REST interaction.
Selecting the right database type for your knowledge management needs.Synaptica, LLC
This presentation looks at relational vs. graph databases and their advantages and disadvantages in storing semantic data for taxonomies and ontologies.
LOD2 plenary meeting in Paris: presentation of WP2: State of Play (Storing and Querying Very Large Knowledge Bases) by Peter Boncz (CWI) and Orri Erling (OpenLink Software)
Practical approaches to entification in library bibliographic dataTerry Reese
This document discusses practical approaches to linking library bibliographic data to external identifiers. It describes BIBFRAME as the new model for linked data and challenges around reconciling legacy data and different model flavors. Tools for experimenting are discussed, including MarcEdit's linked data tools that allow embedding URIs directly in MARC records. The document focuses on MarcEdit's MARCNext toolset and its features for testing BIBFRAME transformations and viewing linked data relationships. It addresses challenges around the readiness of linking services and local implementation of identifiers. Source code for the BIBFRAME testing plugin is also provided.
RDF generator that produces Linked Data that bear similar characteristics with real datasets,
presented at the 1st International Workshop on Benchmarking Linked Data (BLINK).
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
overview of the RDF graph database-as-a-service (GraphDB based) on the Self-Service Semantic Suite (S4)
http://s4.ontotext.com
presentation for the AKSW Group of the University of Leipzig
First Steps in Semantic Data Modelling and Search & Analytics in the CloudOntotext
This webinar will break the roadblocks that prevent many from reaping the benefits of heavyweight Semantic Technology in small scale projects. We will show you how to build Semantic Search & Analytics proof of concepts by using managed services in the Cloud.
Data Processing over very Large Relational Databaseskvaderlipa
Final presentation of my dissertation thesis focused on orientation, analyzing and finding information in large or unknown relational databases and data visualisation
AALL 2015: Hands on Linked Data Tools for Catalogers: MarcEdit and MARCNextTerry Reese
MarcEdit and MARCNext provide linked data tools for catalogers to experiment with BIBFRAME and linked data concepts. The MARCNext tools allow visualization of metadata in various BIBFRAME formats and embedding URIs. The Link Identifiers tool embeds URIs from sources like VIAF, ID.LOC.GOV, and FAST in authority fields. The SPARQL Browser allows querying SPARQL endpoints. The Zepheira training plugin transforms records for the LibHub initiative. However, many linking services are still evolving and local implementations vary, posing challenges for widespread adoption.
This document discusses various data warehousing concepts. It begins by explaining that fact tables can share dimension tables and that typically multiple dimension tables are associated with a single fact table. It then defines ROLAP, MOLAP, and DOLAP architectures for OLAP and discusses how data is stored in each. An MDDB is described as a multidimensional database that stores data in multidimensional arrays, whereas an RDBMS stores data in tables and columns. The differences between OLTP and OLAP systems are outlined. Transformations in ETL are explained as manipulating data from its source form into a simplified form for the data warehouse. Filter transformations are briefly described. Finally, supported default source types for Informatica Power
This document discusses the key role of search in SharePoint 2013 and new search-related features. It provides an overview of the search architecture and components in SharePoint 2013. It also demonstrates query rules and the content search web part. Cross-site publishing is overviewed as another search-dependent feature. The document emphasizes the importance of high availability for search and provides a sample PowerShell script for cloning an active search topology.
This document discusses the development of a resource history service at APNIC that provides access to historical registry data through an RDAP API and user interface. It aims to reconnect disconnected history from registry changes and increase transparency. The service exposes previous registry states through a RDAP extension and has a prototype UI for exploring changing data over time. Feedback is sought on the API and UI as the service moves from experimental to stable.
Cloud-based Linked Data Management for Self-service Application DevelopmentPeter Haase
Peter Haase and Michael Schmidt of fluid Operations AG presented on developing applications using linked open data. They discussed the increasing amount of linked open data available and challenges in building applications that integrate data from different sources and domains. Their Information Workbench platform aims to address these challenges by allowing users to discover, integrate, and customize applications using linked data in a no-code environment. Key components of the platform include virtualized integration of data sources and the vision of accessing linked data as a cloud-based data service.
This was presented by the MongoDB team at the Singapore VIP event on 24th Jan 2019.
The presentation covers-
What is MongoDB
Why MongoDB
MongoDB As a Service, Serverless Platform and Mobile
MongoDB Atlas: Database as a Service (Available on AWS, Azure and Google Cloud)
Usecases
Linked data and the future of librariesRegan Harper
The document discusses a presentation given by OCLC and LYRASIS on linked data and what it means for the future of libraries. It provides an overview of linked data concepts, including defining linked data as using the web to connect related data and lower barriers to linking data. It outlines some of the key principles of linked data, and discusses how linked data can benefit libraries by making data more reusable, efficient to maintain and discoverable. It also notes some of the challenges libraries may face in changing workflows and maintaining information provenance with linked data.
MODAClouds Decision Support System for Cloud Service SelectionIoan Toma
The document discusses MODAClouds' Decision Support System (DSS) for cloud service selection. Some key points:
- The DSS helps users select cloud services by considering multiple dimensions like cost, quality, risks, and technical/business constraints.
- It allows multiple stakeholders like architects, operators, managers to provide input on tangible/intangible assets and risks.
- The DSS performs risk analysis and generates requirements. It also considers issues around multi-cloud environments like interoperability and migration challenges.
- Other features include automatic data gathering from various sources, and progressive learning over time from user inputs and service selections.
MarkLogic is an enterprise NoSQL database platform that can be used for semantic search, data integration, and intelligent recommendation engines. It natively stores XML, JSON, and RDF triples alongside documents. Triples provide context to documents and enable semantic queries over data. MarkLogic also supports full-text and geospatial search, transactions, and flexible indexing and replication of data at scale.
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
The document discusses the LDBC Social Network Benchmark for evaluating database and graph processing systems. It describes the benchmark's social network data generator which produces realistic data following power law distributions and correlations. It also outlines the benchmark's three workloads: interactive, business intelligence, and graph analytics. The focus is on the interactive workload, which includes complex read queries, simple read queries, and concurrent updates. It aims to identify choke points and measure the acceleration factor a system can sustain for the query mix while meeting a maximum query latency. Parameter curation is used to select query parameters that produce stable performance. The parallel query driver respects dependencies between queries to evaluate a system's ability to handle the workload concurrently.
Lighthouse: Large-scale graph pattern matching on GiraphIoan Toma
This document provides an overview of Lighthouse, a system for large-scale graph pattern matching on Giraph. It discusses the timeline of Lighthouse's development, its vertex-centric API, architecture based on BSP and Pregel, and execution algebra using operations like scan, select, project, and joins. Examples are provided of matching graph patterns in Cypher queries and finding the k-shortest paths between nodes while restricting to safe edges. The implementation uses two phases, first computing the routes for each top k path and then traversing back to build the full paths. Preliminary results are mentioned but not detailed.
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)Ioan Toma
HP has a long history of innovation dating back to its founding in a Palo Alto garage in 1939. Some of its notable innovations include the first programmable calculator in 1968, the first pocket scientific calculator in 1972, launching the first inkjet printer in 1984, and being first to commercialize RISC technology in 1986. More recently, HP Labs has developed technologies like ePrint in 2010, 3D Photon technology in 2011, and Project Moonshot in 2013. Going forward, HP Labs is focusing its research on systems, networking, security, analytics, and printing to deliver the fastest and most efficient route from data to value.
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter BonczIoan Toma
This document describes the LDBC Social Network Benchmark (SNB). The SNB was created to benchmark graph databases and RDF stores using a social network scenario. It includes a data generator that produces synthetic social network data with correlations between attributes. It also defines workloads of interactive, business intelligence, and graph analytics queries. Systems are evaluated based on metrics like query response times. The SNB aims to uncover performance choke points for different database systems and drive improvements in graph processing capabilities.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
2. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• Changes in the Basic Interactive Query Mix
• GraphDB Configuration Disclosure
• Future plans
Outline
#2LDBC Semantic Publishing Benchmark Evolution Mar 2015
3. SPB needed to evolve in order to:
• Allow for retrieval of semantically relevant content
– Based on rich metadata descriptions and diverse reference knowledge
• Demonstrate that triplestores offer simplified and
more efficient querying
And this way to:
• Cover more advanced and pertinent usage of
triplestores
• Present a bigger challenge to the engines
Change is Needed
#3LDBC Semantic Publishing Benchmark Evolution Mar 2015
4. • Background and issues with SPB 1.0:
– “Reference data” = master data in Dynamic Semantic Publishing
• Essentially, taxonomies and entity datasets used to describe “creative works”
– SPB 1.0 has very small set of reference data
– Reference data is not interconnected – few relations between entities
– All descriptions of content assets (articles, pictures, etc.) refer to the
Reference Data, but not to one another
– Star-shaped graph with small reference dataset in the middle
• This way SPB is not using the full potential of graph databases
• Bigger and better connected Reference Data
– More reference data and more connections between the entities
• A better connected dataset
– Make cross references between the different pieces of content
Directions for Improvement
#4LDBC Semantic Publishing Benchmark Evolution Mar 2015
5. • Make use of inference
– In SPB 1.0 it is really trivial: couple of subclasses and sub-properties
– It would be relevant to have some transitive, symmetric and inverse
properties
– Such semantics can foster retrieval of relevant content
• More interesting and challenging query mixes
– This can go in plenty of different directions, but the most obvious one
is to evolve the queries so that they take benefit from changes above
– On the bigger picture, there should be a way to make some of the
queries nicer, i.e. simpler and cleaner
• Testing high-availability
– FT acceptance tests for HA cluster are great starting point
– Too ambitious for SPB 2.0
Directions for Improvement (2)
#5LDBC Semantic Publishing Benchmark Evolution Mar 2015
6. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• Changes in the Basic Interactive Query Mix
• GraphDB Configuration Disclosure
• Future plans
Outline
#6LDBC Semantic Publishing Benchmark Evolution Mar 2015
7. • 22M explicit statements total in SPB v.2.0
– vs. 127k statements in SPB 1.0; BBC lists + tiny fractions of GeoNames
• Added reference data from DBPedia 2014
– Extracts alike: DESCRIBE ?e WHERE { ?e a dbp-ont:Company }
– Companies (85 000 entities)
– Events (50 000 entities)
– Persons (1M entities)
• Geonames data for Europe
– All European countries, w/o RUS, UKR, BLR; 650 000 locations total
– Some properties and locations types irrelevant to SPB excluded
• owl:sameAs links between DBpedia and Geonames
– 500k owl:sameAs mappings
Extended Reference Dataset
#7LDBC Semantic Publishing Benchmark Evolution Mar 2015
8. To
Company
To Person To Place /
gn:Feature
To Event
Company 40,797 26,675 218,636 18
Person 89,506 1,324,425 3,380,145 145,892
Event 5,114 154,207 140,579 35,442
Interconnected Reference Dataset
#8LDBC Semantic Publishing Benchmark Evolution
• Substantial volume of connections between entities
• Geonames comes with hierarchical relationship,
defining nesting of locations, gn:parentFeature
• DBPedia inter-entity relationships between entities*:
* numbers shown in table are approximate
Mar 2015
9. • GraphDB’s RDFRank feature is used to calculate a
rank for all the entities in the Reference Data
– RDFRank calculates a measure of importance for all URIs in an RDF
graph, based on Google’s PageRank algorithm
• These RDFRanks are calculated using GraphDB after
the entire Reference Data is loaded
– This way all sorts of relationships in the graph are considered during
calculations, including such that were logically inferred
• RDFRanks are inserted using predicate
<http://www.ldbcouncil.org/spb#hasRDFRank>
• These ranks are available as a part of the Ref. Data
– No need to get compute them again during loading
RDF-Rank for the Reference Data
#9LDBC Semantic Publishing Benchmark Evolution Mar 2015
10. • The Data Generator uses a set of “popular entities”
– Those are referred in 30% of the content-to-entity relations/tags
– This is one of the heuristics used by the data generator to produce
more realistic data distributions
– This is implemented in SPB 1.0, no change in SPB 2.0
• Popular entities are those with top 5% RDF Rank
– This is the new thing in SPB v.2.0
– Before that the popular entities were selected randomly
– This way, we get a more realistic dataset where those entities which
are more often used to tag content also have better connectivity in the
Reference Data
• In the future, RDF-ranks can be used for other
purposes also
RDF-Rank Used for Popular Entities
#10LDBC Semantic Publishing Benchmark Evolution Mar 2015
11. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• Changes in the Basic Interactive Query Mix
• GraphDB Configuration Disclosure
• Future plans
Outline
#11LDBC Semantic Publishing Benchmark Evolution Mar 2015
12. • Generated three times more relationships between
Creative Works and Entities, than in SPB 1.0
– More recent use cases in publishing, adopt rich metadata descriptions
with more than 10 references to relevant entities and concepts
– In SPB 1.0 there are on average a bit less than 3 entity references per
creative work, based on distributions from an old BBC archive
– While developing SPB 2.0 we didn’t have access to substantial amount
of rich semantic metadata from publisher’s archive, to allow us derive
content-to-concept frequency distributions
– So, we decided to use the same distributions from BBC, but to triple
the concept references for each of them
– In SPB 2.0 there are on average about 8 content-to-concept references
– This results in about 25 statements/creative work description
– All other heuristics and specifics of the metadata generator were
preserved (popular entities, CW sequences alike storylines, etc.)
Changes in the Metadata
#12LDBC Semantic Publishing Benchmark Evolution Mar 2015
13. • Geonames URIs used for content-to-location
references
– As in SPB v.1.0, each creative work description refers (through
cwork:Mention property) to exactly one location
• These references are exploited in queries with geo-spatial constraints
– While most of the geo-spatial information in the Reference Data
comes from Geonames, for about 500 thousand locations there are
also DBPedia URIs, that come from the DBPedia-to-Geonames
owl:sameAs mappings
– Intentionally, DBPedia URIs are not used when metadata about
creative works is generated
• In contrast, the substitution parameters for locations in the corresponding queries
use only DBpedia locations
• This way the corresponding queries would return no results if proper SameAs
expansion doesn’t exist
Changes in the Metadata (2)
#13LDBC Semantic Publishing Benchmark Evolution Mar 2015
14. • As a result of changes in the size of the Reference
Data and the metadata size, there are differences in
the number of Creative Works included at the same
scale factor in SPB v.1.0 and in SPB v.2.0
Changes in the Metadata - Stats
#14LDBC Semantic Publishing Benchmark Evolution Mar 2015
SPB 1.0 SPB 2.0
Reference Data size (explicit statements) 170k 22M
Creative Work descr. size (explicit st./CW) 19 25
Metadata in 50M dataset (explicit statements) 50M 28M
Creative Works in 50M datasets (count) 2.6M 1.1M
Creative Work-to-Entity relationships in 50M 7M 9M
Metadata in 1B dataset (explicit statements) 1B 978M
Creative Works in 1B datasets (count) 53M 39M
Creative Work-to-Entity relationships in 1B 137M 313M
15. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• Changes in the Basic Interactive Query Mix
• GraphDB Configuration Disclosure
• Future plans
Outline
#15LDBC Semantic Publishing Benchmark Evolution Mar 2015
16. The Rules Sets
LDBC Semantic Publishing Benchmark Evolution #16
• Complete inference in SPB 2.0 requires support for
the following primitives:
– RDFS: subPropertyOf, subClassOf
– OWL: TransitiveProperty, SymmetricProperty, sameAs
• Any triplestore with OWL 2 RL reasoning support will
be able to process SPB v2.0 correctly
– In fact, a much reduced rule-set is sufficient for complete reasoning in
as in OWL 2 RL there are host of primitives and rules that SPB 2.0 does
not make use of. Such examples are all onProperty restrictions, class
and property equivalence, property chains, etc.
– The popular (but not W3C standardized) OWL Horst profile is sufficient
for reasoning with SPB v 2.0
• OWL Horst refers to the pD* entailment defined by Herman Ter Horst in:
Completeness, decidability and complexity of entailment for RDF Schema and a
semantic extension involving the OWL vocabulary. J. Web Sem. 3(2-3): 79-115 (2005)
Mar 2015
17. • Modified versions of the BBC Ontologies
– As in SPB v.1.0, BBC Core ontology defines relationships between
Person – Organizations – Locations - Events
– Those were mapped to corresponding DBPedia and Geonames classes
and relationships
• bbccore:Thing is defined as super class of dbp-ont:Company, dbp-ont:Event,
foaf:Person, geonames-ontology:Feature (geographic feature)
– dbp-ont:parentCompany is defined to be owl:TransitiveProperty
– These extra definitions are added at the end of the BBC Core ontology
• Added a SPB-modified version of the Geonames
ontology
– Stripped out unnecessary pieces such onProperty restrictions that
impose cardinality constraints
New and Updated Ontologies
LDBC Semantic Publishing Benchmark Evolution #17Mar 2015
18. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• Changes in the Basic Interactive Query Mix
• GraphDB Configuration Disclosure
• Future plans
Outline
#18LDBC Semantic Publishing Benchmark Evolution Mar 2015
19. • Transitive closure of location-nesting relationships
– geo-ont:parentFeature property is used in GeoNames to link each
location to larger locations that it is a part of
– geo-ont:parentFeature is owl:TransitiveProperty in the GN ontology
– This way if Munich has gn:parentFeature Bavaria, that on its turn has
gn:parentFeature Germany, than reasoning should make sure that
Germany also appears as gn:parentFeature of Munich
• Transitive closure of dbp-ont:ParentCompany
– Inference unveils indirect company control relationships
• owl:sameAs relations between Geonames features
and DBPedia URIs for the same locations:
– The standard semantics of owl:sameAs requires that each statement
asserted using the Geonames URIs should be inferable/retrievable also
with the mapped DBPedia URIs
Concrete Inference Patterns
LDBC Semantic Publishing Benchmark Evolution #19Mar 2015
20. • The inference patterns from SPB v.1.0 are still there:
– cwork:tag statements inferred from each of its sub-properties
cwork:about and cwork:mentions
– < ?cw rdf:type cwork:CreativeWork > statements inferred for
instances of each of its sub-classes
– These two are the only reasoning patterns that apply to the generated
content metadata; all the others apply only to the Reference Data
Concrete Inference Patterns (2)
LDBC Semantic Publishing Benchmark Evolution #20Mar 2015
21. • If brute-force materialization is used, the Reference
Data expands from 22M statements to about 100M
– If owl:sameAs expansion is not considered, materialization on top of
the Reference Data adds about 7M statements – most of those coming
from the transitive closure of geo-ont:parentFeature
– Several triplestores (e.g. ORACLE and GraphDB) that use forward-
chaining have specific mechanism, which allow them to handle
owl:sameAs reasoning in a sort of hybrid manner, without expanding
their indices. After materialization, such engines will have to deal with
29M statements of Reference Data, instead of 100M
• Brute-force materialization of the generated
metadata describing creative works would double it
– In SPB v.1.0 the expansion factor was 1.6, but now there are more
entity references and also owl:sameAs equivalents of the location URIs
Inference Statistics and Ratios
LDBC Semantic Publishing Benchmark Evolution #21Mar 2015
22. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• Changes in the Basic Interactive Query Mix
• GraphDB Configuration Disclosure
• Future plans
Outline
#22LDBC Semantic Publishing Benchmark Evolution Mar 2015
23. Modified Queries in Basic Interactive
LDBC Semantic Publishing Benchmark Evolution #23
• SPB 2.0 changes only the Basic Interactive Query-Mix
– The other query mixes are mostly unchanged
• In most of the queries, cwork:about and
cwork:mention properties that link creative work to
an entity, have been replaced with their supper
cwork:tag
– This applies also to the Advanced Interactive and the Analytical use
cases
– For most of the queries, both types of relations are equally relevant
– Using the super-property make the query simpler; as compared to
other approaches to make a query that uses both (e.g. UNION)
Mar 2015
24. New Queries
LDBC Semantic Publishing Benchmark Evolution #24
• Basic Interactive query-mix now adds 2 new queries
exploring the interconnectedness in reference data
• Q10: Retrieve CWs that mention locations in the
same province (A.ADM1) as the specified one
– There is additional constraint on time interval (5 days)
• Q11: Retrieve the most recent CWs that are tagged
with entities, related to a specific popular entity
– Relations can be inbound and outbound; explicit or inferred
Mar 2015
25. Q10: News from the region
LDBC Semantic Publishing Benchmark Evolution #25
SELECT ?cw ?title ?dateModified {
<http://dbpedia.org/resource/Sofia> geo-ont:parentFeature ?province .
?province geo-ont:featureCode geo-ont:A.ADM1 .
{
?location geo-ont:parentFeature ?province .
} UNION {
BIND(?province as ?location) .
}
?cw a cwork:CreativeWork ;
cwork:tag ?location ;
cwork:title ?title ;
cwork:dateModified ?dateModified .
FILTER(?dateModified >=
"2011-05-14T00:00:00.000"^^<http://www.w3.org/2001/XMLSchema#dateTime>
&& ?dateModified <
"2011-05-19T23:59:59.999"^^<http://www.w3.org/2001/XMLSchema#dateTime>)
}
LIMIT 100
Mar 2015
26. Q11: News on about related entities
LDBC Semantic Publishing Benchmark Evolution #26
SELECT DISTINCT ?cw ?title ?description ?dateModified ?primaryContent
{
{
<http://dbpedia.org/resource/Teresa_Fedor> ?p ?e .
}
UNION
{
?e ?p <http://dbpedia.org/resource/Teresa_Fedor> .
}
?e a core:Thing .
?cw cwork:tag ?e ;
cwork:title ?title ;
cwork:description ?description ;
cwork:dateModified ?dateModified ;
bbc:primaryContentOf ?primaryContent .
}
ORDER BY DESC(?dateModified)
LIMIT 100
Mar 2015
27. Trivial Validation Opportunities
LDBC Semantic Publishing Benchmark Evolution #27
• The validation mechanisms from SPB v.1.0 remain
unchanged
• In addition to this, the changes to the benchmark are
designed such a way that in case that specific
inference patterns are not supported by the engine,
some of the queries will return zero results
– If owl:sameAs inference is not supported Q10 will return 0 results
– If transitive properties are not supported, Q10 will return 0 results
• Q11 will return smaller number of results (higher ratio of zeros also)
– If rdfs:subProperty is not supported, Q2-Q10 will return 0 results
– If rdfs:subClass is not supported, Q1, Q2, Q6, Q8 will return no results
Mar 2015
28. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• The New Queries
• GraphDB Configuration Disclosure
• Future plans
Outline
#28LDBC Semantic Publishing Benchmark Evolution Mar 2015
29. 1. Reasoning performed through forward-chaining
with a custom ruleset
– This is the rdfs-optimized ruleset of GraphDB with added rules to
support transitive, inverse and symmetric properties
2. owl:sameAs optimization of GraphDB is beneficial
– This is the standard behavior of GraphDB; but it helps a lot
– Query-time tuning is used to disable expansion of results with
respect to owl:sameAs equivalent URIs
3. Geo-spatial index in Q6 of the basic interactive mix
4. Lucene Connector for full-text search in Q8
Note: Points #2-#4 above help query time, but slow
down updates in GraphDB. As materialization does also
GraphDB Configuration
LDBC Semantic Publishing Benchmark Evolution #29Mar 2015
30. • Experimental run using GraphDB-SE 6.1
• Using LDBC Scale Factor 1 - (22M reference data +
30M generated data)
• Hardware: 32G RAM, Intel i7-4770 CPU @ 3.40GHz,
single SSD Drive Samsung 845 DC
• Benchmark configuration: 8 reading / 2 writing
agents, 1 min warmup, 40 min benchmark
• Updates/sec : 13, Selects /sec : 44
GraphDB – First Results
LDBC Semantic Publishing Benchmark Evolution #30Mar 2015
31. • Motivation
• The New Reference Datasets
• Changes in the generated metadata
• Ontologies and Required Rule-sets
• The Concrete Inference patterns
• Changes in the Basic Interactive Query Mix
• GraphDB Configuration Disclosure
• Future plans
Outline
#31LDBC Semantic Publishing Benchmark Evolution Mar 2015
32. • Relationships between pieces of content
– At present Creative Works are related only to Entities, not to other
CWs
– These could be StoryLines or Collections of MM assets, e.g. an article
with few images related to it
• Better modelling of content-to-entity cardinalities
– Based on data from FT
• Enriched query sets and realistic query frequencies
– Based on query logs from FT
• Loading/updating big dataset in live instances
• Testing high-availability
– FT acceptance tests for High Availability cluster are great starting point
Future Plans
#32LDBC Semantic Publishing Benchmark Evolution Mar 2015
33. • Much bigger reference dataset: from 170k to 22M
– 7M statement from GeoNames about Europe
– 14M statements from DBpedia for Companies, Persons, Events
• Interconnected reference data: more than 5M links
– owl:sameAs links between DBPedia and Geonames
• Much more comprehensive usage of inference
– While still the simplest possible flavor of OWL is used
• rdfs:subClassOf, rdfs:subPorpertyOf, owl:TransitiveProperty, owl:sameAs
– Transitive closure over company control and geographic nesting
– Simpler queries through usage of super-property
• Retrieval of relevant content through links in the
reference data
Summary of SPB v.2.0 Changes
#33LDBC Semantic Publishing Benchmark Evolution Mar 2015