Enabling transparent SQL/SPARQL access to both static and dynamically-computed data
Query languages for databases (e.g., SQL) and knowledge graphs (e.g., SPARQL) provide a concise, declarative, and highly flexible mechanism to access stored data. Yet, many use cases also involve dynamically-computed data available through web APIs or other forms of external services. In such settings, data access is comparatively less flexible (e.g., due to restrictions on available input/output methods), convenient, and sometimes prohibitively slow for users interactively querying data. In this talk, we discuss these problems and present open source solutions that enable querying dynamically-computed data as a “virtual” (since not fully materialized) relational database via SQL, or as a “virtual” knowledge graph via SPARQL, at the same time providing pre-computation and caching solutions to speed up data access. The core components presented in the talk have been developed in the context of the HIVE “Fusion Grant” project and the OntoCRM project, both involving UNIBZ and Ontopic srl. In both projects, we aim at extending virtual knowledge graphs to dynamically-computed data, with a particular focus on applications in the domains of environmental sustainability and climate risk management.
SFScon22 - Francesco Corcoglioniti - Integrating Dynamically-Computed Data and Web APIs into Virtual Databases Knowledge Graph.pdf
1. Integrating Dynamically-Computed Data and Web APIs
into “Virtual” Databases and Knowledge Graphs
Enabling transparent SQL/SPARQL access to both static and dynamically-computed data
Francesco Corcoglioniti
2022-11-11
postdoc @ KRDB, Free University of Bolzano,
supported by HIVE Fusion Grant project (2021-2022), OntoCRM project (2022-2024), and Ontopic s.r.l
2. Background
Data is increasingly available via Web APIs
• access to 3rd-party and/or dynamically-computed data
• access to data-related services, e.g., text search
Some APIs’ statisticsa
• 83% of all Internet traffic belongs to API-based services
• 2M+ API repositories on GitHub
• 90% of developers use APIs
• 30% of development time spent on coding APIs
Complex data access problem for applications operating on
data from both databases and APIs
a
https://nordicapis.com/20-impressive-api-economy-statistics/
RDB Sources
API Sources
SQL
calls
Application
complex
data access
problem
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 1/16
3. Simplify API Access via “Virtual” Databases (VDBs) or “Virtual” Knowledge Graphs (VKGs)
RDB Sources
Virtual Database (VDB)
API Sources
SQL
SQL
calls
Application
RDB Sources
Virtual Knowledge Graph (VKG)
API Sources
SPARQL
SQL
calls
Application • unified data access:
applications operate on
a single DB/KG data
source via a declarative
data manipulation
language (DML)
• virtual DB/KG: its data
is (mostly) kept in the
original sources (no ETL)
• data federation setting:
VDB/VKG queries run by
orchestrating source
sub-queries and API
calls
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 2/16
4. Example Scenario – Extend Open Data Hub (ODH) with Semantic Search
Answer hybrid queries like:
• get (plot) IRI, description, rating &
location of accommodations ...
• whose rating is 3 stars or more
(structured constraint) and ...
• whose EN description matches the
search string “horse riding” (text
constraint)
Semantic search: improved text search
that aims at capturing and leveraging
text meaning (vs term matching only)
• e.g., via BERT-based model from
Sentence Transformers library
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 3/16
5. VDB Specification – SQL/MED
SQL/MED allows federating multiple sources in a virtual database (VDB)
• standardized SQL extension supported by some data federation systems like Teiid
• VDB as a set of schemas mapped to foreign data sources accessed via wrappers/translators
• we extend Teiid with a new service translator for accessing APIs
Example using Teiid with our extensions:
CREATE DATABASE vdb_example OPTIONS ( "... connection options for federated sources ..." );
USE DATABASE vdb_example;
CREATE SERVER db_source FOREIGN DATA WRAPPER postgresql; -- define RDB source with schema 'db'
CREATE SCHEMA db SERVER db_source; -- using 'postgresql' translator to access it
CREATE SERVER srv_source FOREIGN DATA WRAPPER service; -- define API source with schema 'srv'
CREATE SCHEMA srv SERVER srv_source; -- using 'service' translator to access it
IMPORT FOREIGN SCHEMA public FROM SERVER db_source INTO db OPTIONS ( importer.catalog 'public' );
SET SCHEMA srv;
-- CREATE FOREIGN TABLE / PROCEDURE statements mapped to API operations (API bindings)
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 4/16
6. VDB Specification – API Bindings
API operations as SQL/MED procedures
• input tuple → 0..n output tuples
• URL, method, request/response templates
CREATE FOREIGN PROCEDURE api_semsearch_query (
query VARCHAR
) RETURNS TABLE (
query VARCHAR,
id VARCHAR,
score DOUBLE,
excerpt VARCHAR
) OPTIONS (
"method" 'post',
"url" 'http://semsearch:8080/query',
"requestBody" '{"query": "{query}", "n": 100}',
"responseBody" '{"matches": [{
"id": "{id}",
"score": "{score}",
"excerpt": "{excerpt}" }] }'
);
API data as SQL/MED virtual tables
• linked to API operations/procedures
• each procedure defines an access pattern
CREATE FOREIGN TABLE vt_semsearch_match (
query VARCHAR NOT NULL,
id VARCHAR NOT NULL,
score DOUBLE NOT NULL,
excerpt VARCHAR NOT NULL,
PRIMARY KEY (query, id)
) OPTIONS ( "select" 'api_semsearch_query' );
CREATE FOREIGN TABLE vt_semsearch_index (
id VARCHAR PRIMARY KEY,
text VARCHAR NOT NULL
) OPTIONS (
"UPDATABLE" 'true',
"upsert" 'api_semsearch_store',
"delete" 'api_semsearch_clear'
);
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 5/16
7. Query Translation & Execution
Given a VDB defined using SQL/MED + API Bindings and an input query over the VDB
• Teiid splits the query into sub-queries based on translator capabilities and cost heuristics
• sub-queries are sent to translators & Teiid handles remaining operations (e.g., federated joins)
Example SQL query
SELECT s.score,
s.excerpt,
a."AccoCategoryId",
a."AccoDetail-en-Name",
a."AccoDetail-en-City"
FROM srv.vt_semsearch_match AS s
JOIN db.v_accommodationsopen AS a
ON s.id = a."Id"
WHERE s.query = 'horse riding'
ORDER BY s.score DESC
LIMIT 10
Execution plan
LimitNode (limit = 10)
SortNode (s.score DESC)
ProjectNode (s.score, ... a."AccoDetail-en-City")
JoinNode (s.id = a."Id", merge join strategy)
AccessNode (API)
SELECT id, excerpt, score
FROM vt_semsearch_match
WHERE query = ’horse riding’
AccessNode (RDB)
SELECT "Id", "AccoDetail-en-Name",
"AccoDetail-en-City",
FROM v_accommodationsopen
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 6/16
8. Query Translation & Execution – Push-down of Projection, Filtering, Sorting, Slicing
Special input attributes map API capabilities related to standard relational operators
• filtering: return/process only objects matching some criteria (e.g., attribute = or ≥ constant)
• projection: include/exclude certain attributes in returned results
• sorting: sort results according to a certain attribute and direction (ascending/descending)
• slicing: return only a given page of all possible results
CREATE FOREIGN PROCEDURE api_station_data_from_to (
stype VARCHAR NOT NULL,
sname VARCHAR NOT NULL,
tname VARCHAR NOT NULL,
__min_inclusive__mvaliddate DATE NOT NULL, -- filter push down (conditions min <= mvaliddate <= max)
__max_inclusive__mvaliddate DATE NOT NULL,
__limit__ INTEGER -- slicing push down
) RETURNS TABLE ( ... )
) OPTIONS ( ... );
Partial/complete push down of these operators whenever possible
• allows offloading computation to the API (e.g., sorting)
• allows reducing costs by manipulating & transferring less data
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 7/16
9. Query Translation & Execution – Exploiting Bulk API Operations
Bulk API operations operate on multiple input tuples, such as lookup by set of IDs or bulk store
• their use enables better performance due to less API calls
• useful to speed-up dependent joins (using IN operator) between RDBMS and API data
A A
RDBMS table R virtual table S bulk API operation
(A input attribute)
⨝R.A = S.A
SELECT A, …
FROM R
WHERE …
1
SELECT A, …
FROM S
WHERE A IN (a1, a2, …)
AND …
3
2 Extract values of join
attribute A: a1, a2, …
API bindings
4 Bulk API calls with
multiple input tuples for
different values of A:
a1, a2, …
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 8/16
10. Data Materialization
Data materialization: required by API operations that cannot be invoked at query time
• operations too expensive to call at query time (e.g., align API and DB identifiers)
• operations instrumental to the use of external APIs (e.g., text indexing in a search engine)
Solution #1: materialized views in Teiid (or other data federation system used)
Solution #2: dedicated materialization engine for
flexibly executing arbitrary materialization rules:
• identifier – for documentation & diagnostics
• target – the system-managed computed table
(possibly virtual) where data is stored
• source – arbitrary SQL query (over any tables)
that produces the data to store
rules:
- id: index_accommodation_texts
target: vt_semsearch_index
source: |-
SELECT "Id" AS id,
"AccoDetail-en-Longdesc" AS text
FROM v_accommodationsopen
WHERE "AccoDetail-en-Longdesc"
IS NOT NULL
- ... other rules ...
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 9/16
11. Data Materialization (cont’d)
Rules (their SQL source queries) are analyzed to derive a rule dependency graph, which is mapped
to an execution plan using fixpoint rule evaluation for strongly connected components
R1 R2
R3 R4
R5
R1 R2
R3 R4
R5
sequence (
parallel (
R1,
sequence (
R2,
fixpoint (
parallel (
R3,
R4
)
)
)
),
R5
)
Rule / Table Dependencies Rule Dependencies Execution Plan
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 10/16
12. VKG over APIs – Ontology-Based Data Access (OBDA) & Ontop
OBDA builds a VKG on an RDB source
• an ontology defines the VKG
classes and properties (TBox)
• mappings define how to
populate each class/property
with RDB data (ABox)
• query rewriting maps VKG
queries (SPARQL) into native
queries (SQL) over the source
• Ontop open-source system
Idea: build a VDB over APIs, then
apply OBDA to convert it into a VKG
• Ontop + Teiid/service translator
VKGs for Data Access Ontop and Ontopic Developments NL Knowledge Extraction
Query answering by query rewriting
Ontology
Mappings
Data
Sources
. . .
. . .
. . .
. . .
Ontological Query q
Rewritten Query
SQL
Relational Answer
Ontological Answer
Rewriting
Unfolding
Evaluation
Result Translation
Diego Calvanese, Francesco Corcoglioniti, Guohui Xiao (unibz) VGKs for Data Access and Integration Huawei – 03/08/202
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 11/16
13. VKG over APIs – Ontology & Mappings Example
Ontology
schema:Accommodation a owl:Class ;
rdfs:subClassOf schema:Place ;
rdfs:label "Accommodation"@en ;
...
schema:name a owl:DatatypeProperty ;
...
hive:Match a owl:Class ...
Current ontology formalism (OWL 2 QL) reused
as is, but now also models data from APIs
Mappings
mappingId Semantic Search
target data:match/accommodation/{id}/{query}
a hive:Match;
hive:query {query}^^xsd:string;
hive:resource data:accommodation/{id};
hive:excerpt {excerpt}@en;
hive:score {score}^^xsd:decimal.
source SELECT *
FROM hiveodh.srv.vt_semsearch_match
Current VKG mapping formalism reused as is, but
data may now come from API virtual tables
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 12/16
14. VKG over APIs – Query Rewriting & Evaluation Example
User-supplied SPARQL query
SELECT ?h ?posLabel ?rating ?pos {
[] a hive:Match ;
hive:query "horse riding"^^xsd:string ;
hive:resource ?h ;
hive:excerpt ?excerpt ;
hive:score ?score .
?h a schema:LodgingBusiness ;
geo:defaultGeometry/geo:asWKT ?pos ;
schema:name ?name ;
schema:description ?description ;
schema:starRating/schema:ratingValue ?rating.
FILTER (?rating >= 3 && lang(?name) = 'en' &&
lang(?description) = 'en')
BIND (CONCAT(?name, " <br><br>...", ?excerpt,
"...<br><br>", ?description) AS ?posLabel)
}
ORDER BY DESC(?score) LIMIT 10
SQL query rewritten by Ontop
SELECT
v1.id,
v1.excerpt, -- fields used
v2."AccoDetail-en-Name", -- for deriving
v2."AccoDetail-en-Longdesc", -- ?posLabel
... complex expression computing rating ...,
ST_ASTEXT(v2."Geometry")
FROM
hiveodh.srv.vt_semsearch_match v1,
hiveodh.db.v_accommodationsopen v2
WHERE
v1."id" = v2."Id" AND
CAST(v1."query" AS TEXT) = 'horse riding' AND
... complex condition on rating >= 3 ... AND
... nonnull conditions for output columns ...
ORDER BY CAST(v1."score" AS DECIMAL) DESC
LIMIT 10
SQL query evaluated on the VDB by Teiid
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 13/16
15. VKG over APIs – ODH with Semantic Search Demo
Data sources
DB with ODH tourism data +
Semantic search API to index &
query accommodations texts
System
Ontop embedding Teiid +
materialization engine
Demo
https://hive.inf.unibz.it/
odh/vkg/
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 14/16
16. Overall Framework & Ongoing Work
Virtual DB (VDB) Teiid + service translator
VKG Mappings
including virtual tables,
used for query rewriting
Materialization Rules
pre-compute results of
expensive API calls
→ VDB/VKG no more
fully “virtual”
API Bindings
define how to query/update a virtual
table via API calls, if possible
→ limited access patterns RDB Sources
API Sources
Virtual Knowledge Graph (VKG) Ontop
SQL
SQL
calls
Application
(VKG-based)
Application
(VDB-based)
SQL
SPARQL
VKG Ontology
formalizes the classes/properties
(the “schema”) of the VKG,
enabling reasoning
1
3
2
Ongoing work:
1. query rewriting
tuned to VDB + APIs
2. service translator
improvements
3. change data capture
tools (e.g. Debezium)
for incremental
materialization
4. application to
analysis of static +
dynamic data in the
domain of climate
risk management
(OntoCRM project)
Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 15/16