Jens Lehmann's overview of the use of semantics in the Big Data Europe Integrator Platform. Including the Semantic Data Lake (Ontario), and the SANSA Analytics Engine.
2. SANSA: Motivation
◎ Abundant machine readable structured information is
available (e.g. in RDF)
o Across BDE Societal Challenges, e.g. for Life Science Data
o General: DBpedia, Google knowledge graph
o Social graphs: Facebook, Twitter
◎ Need for scalable querying, inference and machine
learning
o Link prediction
o Knowledge base completion
o Predictive analytics
o Forward chaining inference
2
3.
4. SANSA Stack
4
◎ SANSA includes several libraries:
o Read / Write RDF / OWL library
o Querying library
o Inference library
o ML- Machine Learning core library
http://sansa-stack.net/
◎ Framework for distributed RDF dataset processing
aiming at high scalability and fault tolerance
6. SANSA: Read Write Layer
◎ Ingest RDF and OWL data in different formats using
Jena / OWL API style interfaces
◎ Goal: Represent/distribute data in multiple formats
(e.g. RDD, Data Frames, GraphX, Tensors)
◎ Goal: Allow transformation among these formats
◎ Compute dataset statistics and apply functions to URIs,
literals, subjects, objects → Distributed LODStats
6
10. SANSA: Inference Layer
◎ W3C Standards for Modelling: RDFS and OWL
◎ Parallel in-memory inference via rule-based forward
chaining
◎ Beyond state of the art: dynamically build a rule
dependency graph for a rule set
◎ → Adjustable performance levels
10
12. SANSA: ML Layer
◎ Distributed Machine Learning (ML) algorithms that work
on RDF data and make use of its structure / semantics
◎ Work in Progress:
o Tensor Factorization for e.g. KB completion (testing stage)
o Graph Clustering (testing stage)
o Association rule mining (evaluation stage)
o Semantic Decision trees (idea stage)
o Inference in Knowledge Graph Embeddings (idea stage)
12
13. SANSA Status and Releases
◎ A generic stack for (big) Linked Data
o Build on top of a state-of-the-art distributed frameworks (Spark, Flink)
o Easy to use: just add the dependencies to your existing Spark or Flink
project
◎ Out-of-the-box framework for (1) querying, (2) inference
and (3) machine learning over RDF datasets
◎ 0.1 Release in December
◎ Subsequent releases every 6 months
13
14. Semantic Data Lake (Ontario)
◎ Data Lake or Swamp?
o Repository of data in its original formats
o Structured, semi-structured, unstructured
o Without unified schema
◎ Semantic Data Lake
o Add a Semantic Layer on top of the source datasets
❖ The data is semantically lifted using ontology
terms
❖ Provide a uniform view over nonuniform data
14
15. Semantic Data Lake (Ontario)
◎ A SPARQL query is decomposed into sub-queries
o Keeping links between sub-queries to collect data
returned later from each sub-query
◎ Data Lake is checked for candidate data sources that
can answer each sub-query (consulting metadata)
◎ Each SPARQL sub-query is translated to a query on the
query language of the selected candidate data source
o Examples: SPARQL to SQL, or SPARQL to CQL
◎ Sub-resultes are joined together to form the end result
15
16. Metadata
property -> data source (type)
Semantic Data Lake (Ontario)
16
Decomposing
User QuerySPARQL query
Database
XML
File
?item gho:Country ?country
.
?item gho:Disease ?disease
.
...
SELECT country, disease,
... FROM Observations
Finding Relevant Data Sources
+ Converting Queries
SQL XPathSQL
MongoDB
JSON
Path
SQL
XML
MongoDB
17. Semantic Data Lake (Ontario)
17
Database
XML
File
Results Reconciliation
Execution Plan
Links between sub-queries
MongoDB
Final Results
18. Big Data Europe Integrator Platform Launch
Wednesday 3 May @ 15:00 CEST
Please type your questions at any time. Q&A will follow the presentations
20. SANSA: Motivation
20
◎ Over the last years, the size of the Semantic Web has
increased and several large-scale datasets were published.
Source: LOD-Cloud (http://lod-cloud.net/ )
◎ Now days hadoop ecosystem has become a standard for
BigData applications.
◎ We use this infrastructure for Semantic Web as well.
23. SANSA Planning
◎ Current state:
o SANSA 0.1 released in December
o Can read RDF/OWL files, compute statistics, simple
queries, lightweight inference, graph clustering, rule
mining
◎ Subsequent releases every 6 months
o SANSA 0.2 release planned in June 2017.
23
24. Conclusions and Next steps
◎ A generic stack for (big) Linked Data
o Build on top of a state-of-the-art distributed frameworks (Spark, Flink)
◎ Out-of-the-box framework for scalable and distributed
semantic data analysis combining semantic web and
distributed machine learning for (1) querying, (2)
inference and (3) analytics of RDF datasets.
◎ Next steps
o Refinement of data structures (RDF/OWL Layer)
o Add support for SPARQL 1.1 and other backend strategies (Query
Layer)
o Define a ML pipelines for Structured ML (ML Layer).
24
25. BDE & General Integration
◎ SANSA = Scala / Maven Repositories based on Spark /
Flink
◎ Easy to include both in BDE platform and any Spark /
Flink environment
25
Notas do Editor
The traditional Machine learning methods operate on simple feature vector based input and lack expressive and understandable outcomes.
The methods like statistical relational learning and Inductive logic are expressive but lack scalability
Goal:
Build a semantic analytics stack which allows you to perform distributed:
Querying
Inference
Analytics on RDF datasets
Existing Open Source libraries either:
Lack Community
Lack Documentation and Examples
Lack Scalability
Or are research-oriented
Working:
A Library built on top of Spark and Flink
Consists of several APIs
Read / Write: RDF / OWL API for RDF/OWL operations,
Querying API: supports a query language on top of distributed RDF library,
Inference API : Implementation of a rule based reasoner using the querying API
ML- Machine Learning Core Library