Introduction to Big Data and Semantic Web technologies for Big Data. I was presented at Intro Course "Big Data in Agriculture" http://wiki.agroknow.gr/agroknow/index.php/Athens_Green_Hackathon_2013
5. Big Data Is…
Data whose scale, diversity, and complexity
require new architecture, techniques, algorithms,
and analytics to manage it and extract value and
hidden knowledge from it
Slide 5 of 25
6. Big Data Sources
• Biomedical Information
• Sensor Data
• Logs
• E-mails
• Satellite images
• Audio and Video Streams
• Social Networks
Slide 6 of 25
7. Big Data Challenges – “The Three Vs”
…or is it 4…?
Veracity
Volume
Variety
Velocity
…or is it 6… ??
Visualization
Value
Slide 7 of 25
8. Big Data demand…
• Storage
– Impractical or impossible to use centralized storage
• Distribution
• Federation
– Indexing is a problem of itself
• Computational power
– For discovering
– For searching / retrieving
– For joining
• Human effort and expertise
– Querying can become complex
– Are you sure you exploit all this information?
Slide 8 of 25
10. The Syntactic and the Semantic Web
• The World Wide Web represents information
using natural language, graphics, multimedia...
– Humans can process and combine these
information easily
– However, machines are ignorant!
• The Semantic Web is a Web with a meaning
– A web of data that is understandable by the
machines
Slide 10 of 25
11. Semantic Web Technologies
• Common formats for integration and combination of data
drawn from diverse sources, whereas the original Web
mainly concentrated on the interchange of documents.
• For defining
– RDFS http://www.w3.org/TR/rdf-schema/
– OWL http://www.w3.org/TR/owl2-overview/
• For describing
– RDF http://www.w3.org/RDF/
• For querying
– SPARQL http://www.w3.org/TR/2013/REC-sparql11-query-20130321/
Slide 11 of 25
12. What SW can do
• Handle heterogeneity
• Handle evolution / variability
• Elicit inferred knowledge
• Volume is still the challenge
Slide 12 of 25
14. Moving Forward with “Old” Technologies
OAI-PMH Service
Provider #1
OAI-PMH Service
Provider #n
Schema #1
Schema #n
HARVESTER
SPARQL endpoint
SPARQL endpoint
(Data Source #1)
(Data Source #n)
Common Schema
RDF Triple Store
How Many?
Is it
feasible?
Aggregated
XML Repository
INDEXER
AGRIS AP Schema
BigData
Problem!
IEEE LOM Schema
INDEXER
DC Schema
...
SPARQL endpoint
Web Portals
Web Portals
Open AGRIS (FAO)
AgLR/GLN (ARIADNE)
Organic.Edunet (UAH)
VOA3R (UAH)
...
NOW (2012) CASE OF AGRICULTURAL INFRASTRUCTURES
2015 (AgINFRA) CASE OF AGRICULTURAL INFRASTRUCTURES
Slide 14 of 25
15. What Semantic Web can bring into the picture
• One Data Access Point for One Data AccessClient for the entire Data Cloud
Point
– Enabling Service-Data level agreements with Data providers
• Application-level Vocabularies / Thesauri / Ontologies
SemaGrow
SPARQL endpoint
– Enabling different application facets for different communities of users over the SAME data pool
Query
Resource Discovery
Query Decomposition
query
patterns
Query Decomposer
• Going beyond existing Distributed
Triple Store Implementations
Resource Selector
query
pattern
Set of
query
patterns
Candidate Source(s) List
Instance Statistics
Load Info
Semantic Proximity
equivalent Semantic
patterns Proximity
Query Pattern Discovery
Service
Instance
Statistics
Ctrl
Data Source(s) Selector
Reactivity
parameters
–Link Heterogeneous but Semantically Connected
Data
–Index Extremely Large Information Volumes (Peta
Sizes)
–Improve Information Retrieval response
query fragment,
Source
(#1)
query fragment,
Source
(#n)
Query
results
Ctrl
Load Info
Data Summaries
SPARQL endpoint
Instance Statistics
query fragment,
target Source
POWDER
Inference Layer
Query Transformation
Service
Query Manager
Ctrl
transformed query
query
request #1
Schema
Mappings
query
request #n
•
Instance Statistics
SPARQL
query
query
results
query results schema
Data Summaries
Query Results Merger
P-Store
transformed schema
SPARQL
query
query
results
Federated endpoint Wrapper
Data (+Metadata)
physically stored in Data
Provider
No need for harvesting
•
Vocabularies / Thesauri /
Ontologies of Data Provider
SPARQL endpoint
(Data choice
Source #n)
– No need for aligning
according to common
schemas
SPARQL endpoint
–
(Data Source #1)
Slide 15 of 25
16. The SemaGrow Solution
• Use POWDER to mass-annotate large-subspaces
– Exploit naming convention regularities to compress
the indexes used by the system
• Partition triple patterns in the original query
• Annotate each fragment with an ordered list of
data sources most likely to contain relevant data
• Distribute and transform the query fragments
• Collect and align the results
Slide 16 of 25
17. The POWDER W3C Recommendation
• Exploits natural groupings of URIs to annotate all
resources in a subset of the URI space
• Regular expression based grouping
• Allows properties and their values to be
associated with an arbitrary number of subjects
within a fully-defined semantic framework
•
•
POWDER Description Resources: http://www.w3.org/TR/powder-dr/
POWDER Formal Semantics: http://www.w3.org/TR/powder-formal/
Slide 17 of 25
18. The SemaGrow Stack
• Integrates the components in order to offer a single
SPARQL endpoint that federates a number of
heterogeneous data sources
• Targets the federation of independently provided
data sources
Slide 18 of 25
20. Use Cases (DLO)
Heterogeneous Data Collections &
Streams
Big data:
–
–
–
–
Sensor data: soil data, weather
GIS data: land usage, forest and natural resources management data
Historical data: crop yield, economic data
Forecasts: climate change models
Problem:
– Combine heterogeneous sources to analyze past food production and
forecast future trends
– Cannot clone and translate: large scale, live data streams
– Cannot immediately and directly affect radical re-design of all sensing
and processing currently in place
3rd Plenary & ESG Meeting
21/10/2013
Slide 24 of 25
21. Use Cases (FAO)
Reactive Data Analysis
Big data:
– Document collections: past experiences, analysis and research results
– Databases: climate conditions and crop yield observations, economic
data (land and food prices)
Problem:
– Retrieving complete and accurate information to compile reports
• Raw data and reports, scientific publications, etc.
– Wastes human resources that could analyze data and synthesize useful
knowledge and advice for food production
• Too much time spent cross-relating responses from different sources
– Too many different organizations and processes rely on the different
schemas to make re-design viable
– Cloning is inefficient: large and constantly updated stores
3rd Plenary & ESG Meeting
21/10/2013
Slide 25 of 25
22. Use Cases (AK)
Reactive Resource Discovery
Big data:
– Multimedia content about agriculture and biodiversity
Problem:
– Real-time retrieval of relevant content
– Used to compile educational activities
– Schema heterogeneity:
• Different providers (Oganic edunet, Europeana, VOA3R, etc.)
– Too many different organizations and processes rely on the different
schema to make re-design viable
– Cloning is inefficient: large and constantly updated stores
3rd Plenary & ESG Meeting
21/10/2013
Slide 26 of 25
23. Project Info
• SemaGrow: Data intensive techniques to boost the realtime performance of global agricultural data infrastructures
• FP7-ICT-2011.4.4 (Intelligent Information Management)
No.
Name
1
Universidad de Alcala
2
NCSR “Demokritos”
3
Universita Degli Studi di Roma Tor Vergata
4
Semantic Web Company
5
Institut Za Fiziku
6
Stichting Dienst Landbouwkundik Onderzoek
7
Food and Agriculture Organization of the UN
8
Countr
y
Agroknow Technologies
Slide 27 of 25