1. BigDataEurope - Supporting the
Variety Dimension of Big Data
Mohamed Nadjib MAMI - Fraunhofer IAISICWE17 - 06.06.2017
2. Big Data Europe - the Project
◎ EU Horizon 2020-programme-funded
◎ Coordination & Support action (CSA) Project
o Show societal value of Big data to 7 Domains
o Lower barrier for using Big Data technologies
=> BigDataEurope Platform
2
4. BDE Europe - The Platform
◎ Integrator of Big Data technologies
o Easy to use/get started (plug-and-play)
o Flexible, Customisable
◎ Bundles with only Open Source solutions
o Data Storage
o Message Passing
o Data Processing
o Data Searching & Publishing
◎ Publicly released in May 2017
4
6. BDE Platform - Architecture
Support Layer
Init Daemon
GUIs
Base Setup
App Layer
Traffic
Forecast
Satellite Image
Analysis
Platform Layer
Spark Flink Semantic Layer
Ontario SANSA Semagrow
Kafka
Real-time Stream
Monitoring
...
...
Resource Management Layer (Swarm)
Hardware Layer
Premises Cloud (AWS, GCE, MS Azure, …)
Data Layer
Hadoop NOSQL Store CassandraElasticsearch ...RDF Store
Semantic Data Lake (Unified View)
6
7. BDE Platform - Hardware & Virtualization
◎ Docker used for packaging and deploying applications
◎ Based on containers:
o A lightweight environment to make a piece of
software run in isolation
❖ Shares the host operating system kernel (unlike
VMs)
❖ Reduces conflicts e.g., versions
◎ Docker Compose: creates multi-container applications
7
8. BDE Platform - Resource Managements
◎ Swarm (mode) used for managing, scheduling and
orchestrating Dockers in multi-node clusters
◎ It provides:
o Scalability and Fault Tolerance
o Containers interlinking
o Log-based monitoring
◎ Separate hardware from software management
◎ Based on Services
o Swarm execution unit running a Docker Image
8
9. BDE Platform - Support Layer
◎ Init Daemon: orchestrates the initialization process of
the components (containers of Docker Compose):
o Components report their initialization progress
o It validates whether a specific component can start
o It specifies the dependencies between services
o It Indicates where a human interaction is required
◎ Examples:
o Wait data to load to HDFS to start a Spark job
o Wait Spark Master to successfully start to start a Worker
9
10. BDE Platform - User Interfaces
10
Component 1
Component 2
Component 3
Pipeline Builder: creates step-by-step dependency
pipeline (fed to the init daemon)
11. BDE Platform - User Interfaces
11
Component 1
Finished
Component 2
Finished
Component 3
Inprogress
Pipeline Monitor: displays the status (not started, running or finished) of
components in a running pipeline (retrieved from the init daemon)
12. BDE Platform - User Interfaces
12
Swarm UI: allows to clone a Git repository containing a
pipeline and deploys/controls/monitors it on Swarm
13. BDE Platform - User Interfaces
13
Integrator UI: displays the dashboard of each running
component in a unified interface
14. BDE Platform - Semantic Layer > Ontario
◎ Data Lake or Swamp?
o Repository of data in its original formats
o Structured, semi-structured, unstructured
o Without unified schema
◎ Semantic Data Lake (Ontario)
o Add a Semantic Layer on top of the source datasets
❖ The data is semantically lifted using ontology
terms
❖ Provide a uniform view over nonuniform data
14
15. BDE Platform - Semantic Layer > Ontario
15
SELECT count(distinct(?publication))
AS ?no_of_publications
count(?deaths) AS ?no_of_deaths
WHERE {
?item a qb:Observation .
?item gho:Country ?country .
?item gho:Disease ?disease .
?item att:unitMeasure gho:Measure .
?item eg:incidence ?deaths .
?country rdfs:label "India" .
?disease rdfs:label "Tuberculosis".
?trial a ct:trials .
?trial ct:condition ?condition .
?trial ct:location ?location .
?trial ct:reference ?publication.
?condition owl:sameAs ?disease .
?location redd:locatedIn ?country .
?publication ct:citation ?citation.
}
?item a qb:Observation .
?item gho:Country ?country .
?item gho:Disease ?disease .
?item att:unitMeasure gho:Measure .
?item eg:incidence ?deaths .
?trial a ct:trials .
?trial ct:condition ?condition .
?trial ct:location ?location .
?trial ct:reference ?publication.
?condition owl:sameAs ?disease .
?disease rdfs:label "Tuberculosis".
?country rdfs:label "India" .
?location redd:locatedIn ?country .
?publication ct:citation ?citation.
Query “number of distinct publications and number of
distinct deaths due to the disease Tuberculosis in India”
18. BDE Platform - Semantic Layer > SANSA
18
SANSA a Framework for distributed RDF
data processing
◎ Read/write Layer: Read and write
native RDF/OWL data in distributed
storage e.g., Hadoop, Spark (RDD,
DataFrames, GraphX), Tensors
following different representations &
partitioning scheme e.g., graphs, tables
◎ Querying Layer: Query distributed
RDF using SPARQL (SPARQL-to-SQL
approaches, Virtual Views, Intelligent
Indexing, ...)
http://sansa-stack.net
19. BDE Platform - Semantic Layer > SANSA
19
http://sansa-stack.net
◎ Inference Layer: Derive new facts from
existing ones, detect inconsistencies,
extract new rules to help in reasoning
◎ Machine Learning Layer: Perform ML
or analytics to gain insights for relevant
trends, predictions or detection of
anomalies from RDF data
o Tensor Factorization for e.g. KB
completion (testing stage)
o Graph Clustering (testing stage)
o Association rule mining (evaluation stage)
o Semantic Decision trees (idea stage)
o Inference in Knowledge Graph
Embeddings (idea stage)
20. BDE Platform - Semantic Layer > Semagrow
Semagrow a SPARQL query processing system that federates
multiple remote endpoints
◎ Original Semagrow
o Optimizes queries transparently
o Executes sub-queries in the remote endpoints
o Integrates results dynamically in heterogeneous data
models
o Joins the partial results into the final query answer
◎ Next-gen Semagrow
o Support different querying languages
o Query planner and execution engine adapted
e.g., translate SPARQL to CQL for Cassandra
databases
20
21. BDE Showcases (pilots)
21
SC1 SC2 SC3 SC4 SC5 SC6
SC7
SC1 - Open PHACTS discovery platform relating to biological/medical questions
SC2 - Discovery and Linking of Viticulture-relevant information
SC3 - System monitoring in energy production units
SC4 - Short-Term traffic flow forecasting.
SC5 - Supporting data-intensive climate research
SC6 - Citizens & Researchers Budget on Municipal Level
SC7 - Ingestion of remote sensing images and social sensing data to detect and verify
changes on the Earth surface for security applications
◎ 7 Societal Challenges > 7 pilot implementations
22. Showcase SC1: Health, demographic
change and wellbeing
◎ SC1 Implements Open PHACTS Discovery Platform
o Integrates and links data from multiple sources:
ChEBI, ChEMBL, the Gene Ontology and UniProt
(Chemistry, Biological, Medical, etc.)
o Explores the relationships between data
(compounds, targets, pathways, diseases and
tissues)
o Data accessed using RESTful-API requests
❖ Translated to SPARQL queries
◎ Technologies used:
o 4Store, Memchached, MySQL, Puelia, SWAGGER
22
23. Showcase SC7: Secure Societies
◎ Detect changes in land cover in satellite images (e.g.,
monitoring critical infrastructures)
◎ Display geo-located events in news sites and social
media (e.g., news articles, social networks)
◎ Three workflows:
o Change detection workflow
o Event detection workflow
o Activation workflow
◎ Technologies used: Apache Spark, Cassandra,
Sextant, Semagrow, Strabon, GeoTriples
23
24. Showcase 2 (SC7): Secure Societies
24
General Architecture of the SC7 Pilot
25. Showcase 2 (SC7): Secure Societies
area and the time
interval of interest
Satellite Images Compare Images
Change detection workflow
25
26. Showcase 2 (SC7): Secure Societies
Event detection workflow
Associate names
with coordinates
Cluster news into events
(associate geo-location)
26
27. Showcase 2 (SC7): Secure Societies
Activation detection workflow
Areas with changes
Summary of events
Spatiotemporal
RDF store
27
28. Showcase 2 (SC7): Secure Societies
refugee camps located in Zaatari, Jordan
28
News
TweetsSelected
Area
Detected
changes
29. Thanks & Questions?
For more info...
◎ Project-related: Simon Scerri (scerri@cs.uni-bonn.de)
◎ Ontario: Mohamed Nadjib Mami (mami@cs.uni-bonn.de)
◎ SANSA: Jens Lehmann (jens.lehmann@cs.uni-bonn.de)
◎ Semagrow: Stasinos Konstantopoulos (konstant@iit.demokritos.gr)
◎ Pilots (showcases):
o SC1: Ronald Siebes (rm.siebes@few.vu.nl)
o SC7: George Papadakis (gpapadis@di.uoa.gr)
o All: Ronald Siebes (rm.siebes@few.vu.nl)
◎ Github repos: https://github.com/big-data-europe/README
◎ Website: https://big-data-europe.eu
29
30. BDE Platform vs. Hadoop Distributions
30
SFR = Single failure recovery
MFR = Multiple failure recovery
SF = Self healing