Introduction to Big data

Open Data for Agriculture
Intro to Big Data
29/11/2013
Athens, Greece
Joint offering by

Supported by EU projects

Intro to Big Data

Antonis Koukourikos
NCSR “Demokritos”

Presentation Outline
• What is Big Data?
• Semantic Web Technologies

• What Semantic Web brings into the picture

Slide 3 of 25

Big Data Is…

Data whose scale, diversity, and complexity
require new architecture, techniques, algorithms,
and analytics to manage it and extract value and
hidden knowledge from it

Slide 5 of 25

Big Data Sources
• Biomedical Information

• Sensor Data
• Logs
• E-mails
• Satellite images
• Audio and Video Streams
• Social Networks

Slide 6 of 25

Big Data Challenges – “The Three Vs”
…or is it 4…?

Veracity
Volume

Variety
Velocity

…or is it 6… ??

Visualization

Value

Slide 7 of 25

Big Data demand…
• Storage
– Impractical or impossible to use centralized storage
• Distribution
• Federation

– Indexing is a problem of itself

• Computational power
– For discovering
– For searching / retrieving
– For joining

• Human effort and expertise
– Querying can become complex
– Are you sure you exploit all this information?
Slide 8 of 25

Part 2

SEMANTIC WEB TECHNOLOGIES

The Syntactic and the Semantic Web
• The World Wide Web represents information
using natural language, graphics, multimedia...
– Humans can process and combine these
information easily
– However, machines are ignorant!

• The Semantic Web is a Web with a meaning
– A web of data that is understandable by the
machines

Slide 10 of 25

Semantic Web Technologies
• Common formats for integration and combination of data
drawn from diverse sources, whereas the original Web
mainly concentrated on the interchange of documents.
• For defining
– RDFS http://www.w3.org/TR/rdf-schema/
– OWL http://www.w3.org/TR/owl2-overview/

• For describing
– RDF http://www.w3.org/RDF/

• For querying
– SPARQL http://www.w3.org/TR/2013/REC-sparql11-query-20130321/

Slide 11 of 25

What SW can do
• Handle heterogeneity
• Handle evolution / variability
• Elicit inferred knowledge

• Volume is still the challenge

Slide 12 of 25

Part 3

WHAT SEMANTIC WEB BRINGS IN THE BIG
DATA PICTURE

Moving Forward with “Old” Technologies
OAI-PMH Service
Provider #1

OAI-PMH Service
Provider #n

Schema #1

Schema #n

HARVESTER

SPARQL endpoint

SPARQL endpoint

(Data Source #1)

(Data Source #n)

Common Schema

RDF Triple Store

How Many?
Is it
feasible?

Aggregated
XML Repository

INDEXER

AGRIS AP Schema

BigData
Problem!

IEEE LOM Schema

INDEXER
DC Schema
...

SPARQL endpoint

Web Portals

Web Portals

Open AGRIS (FAO)
AgLR/GLN (ARIADNE)
Organic.Edunet (UAH)
VOA3R (UAH)
...

NOW (2012) CASE OF AGRICULTURAL INFRASTRUCTURES

2015 (AgINFRA) CASE OF AGRICULTURAL INFRASTRUCTURES

Slide 14 of 25

What Semantic Web can bring into the picture
• One Data Access Point for One Data AccessClient for the entire Data Cloud
Point
– Enabling Service-Data level agreements with Data providers

• Application-level Vocabularies / Thesauri / Ontologies
SemaGrow
SPARQL endpoint

– Enabling different application facets for different communities of users over the SAME data pool
Query
Resource Discovery

Query Decomposition
query
patterns

Query Decomposer

• Going beyond existing Distributed
Triple Store Implementations
Resource Selector

query
pattern

Set of
query
patterns

Candidate Source(s) List
Instance Statistics
Load Info
Semantic Proximity

equivalent Semantic
patterns Proximity

Query Pattern Discovery
Service

Instance
Statistics

Ctrl

Data Source(s) Selector

Reactivity
parameters

–Link Heterogeneous but Semantically Connected
Data
–Index Extremely Large Information Volumes (Peta
Sizes)
–Improve Information Retrieval response
query fragment,
Source
(#1)

query fragment,
Source
(#n)

Query
results

Ctrl

Load Info

Data Summaries
SPARQL endpoint

Instance Statistics

query fragment,
target Source

POWDER
Inference Layer

Query Transformation
Service

Query Manager

Ctrl

transformed query

query
request #1

Schema
Mappings

query
request #n

•

Instance Statistics

SPARQL
query
query
results

query results schema

Data Summaries

Query Results Merger

P-Store
transformed schema

SPARQL
query
query
results

Federated endpoint Wrapper

Data (+Metadata)
physically stored in Data
Provider

No need for harvesting
•
Vocabularies / Thesauri /
Ontologies of Data Provider
SPARQL endpoint
(Data choice
Source #n)
– No need for aligning
according to common
schemas
SPARQL endpoint
–
(Data Source #1)

Slide 15 of 25

The SemaGrow Solution
• Use POWDER to mass-annotate large-subspaces
– Exploit naming convention regularities to compress
the indexes used by the system

• Partition triple patterns in the original query
• Annotate each fragment with an ordered list of
data sources most likely to contain relevant data
• Distribute and transform the query fragments
• Collect and align the results

Slide 16 of 25

The POWDER W3C Recommendation
• Exploits natural groupings of URIs to annotate all
resources in a subset of the URI space
• Regular expression based grouping

• Allows properties and their values to be
associated with an arbitrary number of subjects
within a fully-defined semantic framework
•
•

POWDER Description Resources: http://www.w3.org/TR/powder-dr/
POWDER Formal Semantics: http://www.w3.org/TR/powder-formal/

Slide 17 of 25

The SemaGrow Stack
• Integrates the components in order to offer a single
SPARQL endpoint that federates a number of
heterogeneous data sources
• Targets the federation of independently provided
data sources

Slide 18 of 25

SemaGrow Architecture
Client

SemaGrow
SPARQL endpoint
Query
Resource Discovery

Query Decomposition
query
patterns

Resource Selector

Resource Discovery
query
pattern

Set of
query
patterns

Candidate Source(s) List
Instance Statistics
Load Info
Semantic Proximity

equivalent Semantic
patterns Proximity

Query Pattern Discovery
Service

Instance
Statistics

Query Decomposer
Ctrl

Query
Decomposition
Data Source(s) Selector

Reactivity
parameters

query fragment,
Source
(#1)

query fragment,
Source
(#n)

Query
results

Ctrl

Load Info

Data Summaries
SPARQL endpoint

Instance Statistics
query fragment,
target Source

Data
Summaries
Endpoint
POWDER
Inference Layer

Query Transformation
Service

Query Manager
Ctrl

transformed query

Federated Endpoint
Wrapper
query
request #1

Schema
Mappings

query
request #n

Instance Statistics

SPARQL
query
query
results

query results schema

Data Summaries

SPARQL endpoint
(Data Source #1)

Query Results Merger

P-Store
transformed schema

SPARQL
query
query
results

SPARQL endpoint
(Data Source #n)

Federated endpoint Wrapper

Slide 19 of 25

Use Cases (DLO)

Heterogeneous Data Collections &
Streams
 Big data:
–
–
–
–

Sensor data: soil data, weather
GIS data: land usage, forest and natural resources management data
Historical data: crop yield, economic data
Forecasts: climate change models

 Problem:
– Combine heterogeneous sources to analyze past food production and
forecast future trends
– Cannot clone and translate: large scale, live data streams
– Cannot immediately and directly affect radical re-design of all sensing
and processing currently in place
3rd Plenary & ESG Meeting

21/10/2013
Slide 24 of 25

Use Cases (FAO)

Reactive Data Analysis
 Big data:
– Document collections: past experiences, analysis and research results
– Databases: climate conditions and crop yield observations, economic
data (land and food prices)

 Problem:
– Retrieving complete and accurate information to compile reports
• Raw data and reports, scientific publications, etc.

– Wastes human resources that could analyze data and synthesize useful
knowledge and advice for food production
• Too much time spent cross-relating responses from different sources

– Too many different organizations and processes rely on the different
schemas to make re-design viable
– Cloning is inefficient: large and constantly updated stores

21/10/2013
Slide 25 of 25

Use Cases (AK)

Reactive Resource Discovery
 Big data:
– Multimedia content about agriculture and biodiversity

 Problem:
– Real-time retrieval of relevant content
– Used to compile educational activities
– Schema heterogeneity:
• Different providers (Oganic edunet, Europeana, VOA3R, etc.)

– Too many different organizations and processes rely on the different
schema to make re-design viable
– Cloning is inefficient: large and constantly updated stores

21/10/2013
Slide 26 of 25

Project Info
• SemaGrow: Data intensive techniques to boost the realtime performance of global agricultural data infrastructures
• FP7-ICT-2011.4.4 (Intelligent Information Management)
No.

Name

1

Universidad de Alcala

2


3

Universita Degli Studi di Roma Tor Vergata

4

Semantic Web Company

5

Institut Za Fiziku

6

Stichting Dienst Landbouwkundik Onderzoek

7

Food and Agriculture Organization of the UN

8

Countr
y

Agroknow Technologies
Slide 27 of 25

Thank you!

Antonis Koukourikos
kukurik@iit.Demokritos.gr

Introduction to Big data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Big data

Similar to Introduction to Big data (20)

More from cthanopoulos

More from cthanopoulos (15)

Recently uploaded

Recently uploaded (20)

Introduction to Big data