SlideShare uma empresa Scribd logo
1 de 19
Baixar para ler offline
TOWARDS
AN ARCHITECTURE
FOR MANAGING
BIG SEMANTIC DATA
IN REAL-TIME
Carlos E. Cuesta, VorTIC3, URJC, Spain
Miguel A. Martínez-Prieto, UVa, Spain
Javier D. Fernández, UVa, Spain & UChile, Chile
Montpellier, France, 02/07/2013
CONTENTS
 Introduction
 Problem Statement
 Context: the RDF world
 Proposal: SOLID Architecture
 Unfolding in five Layers
 SOLID in Practice
 The RDF/HDT format
 The SOLID/HDT Architecture
 Conclusions & Future work 2
INTRODUCTION
 Big Data has become an important topic
 When the size of the data itself becomes part of the
problem (Loukides)
 Characterized by the “three Vs”
 Volume: large amounts of data gathered and stored
 The challenge is storage, but also computing
 Volume is relative: depends on available resources
 Velocity: different flows of data at different rates
 Variety: the kind of structures within the data
 Each source has its own semantics
 Need of a logical model to allow data integration
 Architecture for Big Data must consider all these 3
INTRODUCTION
 One of the dimensions gets always critical
 E.g. storage in mobile applications, velocity in real-
time applications (vs. batch processes)
 We promote variety
 The dataset value is increased when multiple sources
are integrated, achieving more knowledge
 This also influences velocity and volume
 We choose a graph-based model
 Allows to manage higher levels of variety
 Data can be linked and queried together
 In practice, this means using RDF as data model
 The cornerstone of the “practical” Semantic Web
 The basis of the emergent Web of Data
4
PROBLEM STATEMENT
 Most solutions to manage Big Data intend to
maximize the volume dimension
 Therefore promoting efficient storage
 Datastores able to cope with large datasets
 Indexing strategies to achieve high space
 Datastores must be assumed to be stable
 In spite of the assumed immutability property
 But, the volume of incoming data is also big
 Datastores must be periodically updated & reindexed
 This is very complex in a Real-Time context
 Data must be received and integrated in real time
 No time to process the flow of incoming data 5
OUR PROPOSAL: SOLID ARCHITECTURE
 We propose an specific architecture to manage
Real-Time flows in this context
 A multi-tiered architecture
 Separate comsuption of Big Semantic Data…
 … from the complexities of Real-Time operation
 Data must be preserved compact
 It is stored and indexed in a compressed way
 Data & Index Layers
 Needs to efficiently cope with data updates
 The reason for the Online Layer
 Needs to query all of this together
 The reason for the Service Layer 6
CONTEXT: RDF
 RDF: Resource Description Framework
 Data described as (subject, predicate, object) triples
 An RDF dataset is a graph of knowledge
 Entities linked to values via labelled edges
 Essential for Linked Open Data
 Adopted in many different contexts
 Simple integration: everything has an URI
7
John Car
owns
CONTEXT: RDF
 The origin of the Web of Data
 Two datasets can become connected by a single triple
<“Station #123, location, Canal Street>
 The web becomes data-centric
 Every unit is a small piece of data
 “The Big Data’s long tail”
 But their integration in large contexts become
complex: Big Semantic Data
 A variety of sources become easily integrated
 RDF is not a serialization format
 Describes what data is stored, not how this is done 8
SOLID ARCHITECTURE
10
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing
SOLID ARCHITECTURE
11
INDEX LAYER
New Data
Dump
Rd
DataStore
DATA LAYER
Big Data
MERGE LAYER
(BATCH)
Query
Join
SERVICE LAYER
ONLINE LAYER
Parallelizable
Processing
RDF
SPARQL
SOLID ARCHITECTURE
 Online Layer
 Receives incoming new data
 Deals with real-time needs
 Data Layer
 The core of the architecture
 The main datastore: the Big Data repository
 Stores compressed RDF
 Index Layer
 Provides an index for the Data Layer, to make
possible high-speed access
 Most accesses to the repository are made through it
12
SOLID ARCHITECTURE
 Service Layer
 The façade to the external user
 Able to ask federated SPARQL queries to the
separate datastores in different layers
 Every query is distributed, and the different answers
are joined
 Merge Layer
 Makes possible to integrate the two datastores
 Receives a dump of data of the online layer
 Integrates that with the existing repository
 Producing a fresh copy of the Data Layer
 Immutability properties are preserved 13
SOLID IN PRACTICE
 This abstract architecture is possible due to
application to existing technology
 In particular, the RDF/HDT binary format
 Decisions must be taken, layer by layer, about
how to actually implement it
 Other alternatives would also be possible (and some
of them are also being implemented)
 Data-Centric Layers
 Do not use a textual RDF representation
 Inefficient, prevents some potential uses
 RDF/HDT is a binary format
 Conceived specifically for serialization purposes 14
SOLID IN PRACTICE
 RDF/HDT format
 Designed for machine processing
 About 15 times less space than equivalent formats
 Uses compact (compressed) data structures
 Data Layer
 Big Semantic Data in RDF/HDT
 Data saving and guaranteed immutability
 Instant mapping to memory
 Allow querying withoug decompressing
 Index Layer
 Implements the HDT/FoQ proposal
 Lightweight index on top of the HDT binary format
 Efficient SPARQL retrieval without decompressing 15
SOLID IN PRACTICE
 Online Layer
 Copes with the incoming flow of real-time data
 HDT is inadequate (designed for read-only)
 Must resolve SPARQL efficiently
 Choose a general-purpose NoSQL technology
 Still able to dump data in an RDF format
 Service Layer
 Resolves any potential queries
 SPARQL considered expressive enough
 Queries are forwarded to Online and Index Layers
 Their results are retrieved and combined
 Using an (scalable) Pipe-Filter approach 16
SOLID IN PRACTICE
 Merge Layer
 Able to combine incoming data from the Online Layer
with the existing datastore in the Data Layer
 The data dump is merged into a copy of the datastore
 Then the fresh datastore replaces the previous one
 Periodical process, can also be manually triggered
 Requires high-performance computation
 In practice, this means a Map/Reduce approach
 Raw RDF data from Online Layer is converted
 Then ordered for internal merging
 Depends on the size of the smaller store
 Also triggers reindexing the Index Layer 17
SOLID ARCHITECTURE IN PRACTICE
18
INDEX LAYER
New Data
Dump
Rd
NoSQL
DATA LAYER
RDF/HDT
MERGE LAYER
(BATCH)
HADOOP
SPARQL
SPARQL
+ P/F
SERVICE LAYER
ONLINE LAYER
Semantic
Data
CONCLUSIONS & FUTURE WORK
 We propose SOLID as a generic architecture for
managing Big Semantic Data
 Our particular implementation relies on HDT
 Also NoSQL for real-time incoming data
 Cassandra, but (still) not the only choice
 Map/Reduce (Hadoop) for intensive processing
 Highly effective in terms of space & time
 Initial empirical results are very significant
 Currently developing an optimized prototype
 Already working on variants of the architecture
 Limited version for mobile devices
 The Merge Layer is not directly requred
19
THANKS FOR YOUR ATTENTION
20

Mais conteúdo relacionado

Mais procurados

معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهWeb Standards School
 
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...rajappaiyer
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studiosantosluis87
 
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...HostedbyConfluent
 
Olap, oltp and data mining
Olap, oltp and data miningOlap, oltp and data mining
Olap, oltp and data miningzafrii
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)Pooja Mishra
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETLLily Luo
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRamakant Soni
 
SQL Server Abbreviations
SQL Server AbbreviationsSQL Server Abbreviations
SQL Server AbbreviationsUmar Ali
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Databricks
 
The Data Web and PLM
The Data Web and PLMThe Data Web and PLM
The Data Web and PLMKoneksys
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageBethmi Gunasekara
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETLganblues
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) toolskulkarnivaibhav
 
Operationalizing Big Data
Operationalizing Big DataOperationalizing Big Data
Operationalizing Big DataStratio
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwramesh rao
 

Mais procurados (20)

معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
 
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows rel...
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 
Open Source ETL using Talend Open Studio
Open Source ETL using Talend Open StudioOpen Source ETL using Talend Open Studio
Open Source ETL using Talend Open Studio
 
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
Enforcing Schemas with Kafka Connect | David Navalho, Marionete and Anatol Lu...
 
Olap, oltp and data mining
Olap, oltp and data miningOlap, oltp and data mining
Olap, oltp and data mining
 
ETL DW-RealTime
ETL DW-RealTimeETL DW-RealTime
ETL DW-RealTime
 
Etl with talend (data integeration)
Etl with talend (data integeration)Etl with talend (data integeration)
Etl with talend (data integeration)
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETL
 
tecFinal 451 webinar deck
tecFinal 451 webinar decktecFinal 451 webinar deck
tecFinal 451 webinar deck
 
Role of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data WarehouseRole of Data Cleaning in Data Warehouse
Role of Data Cleaning in Data Warehouse
 
SQL Server Abbreviations
SQL Server AbbreviationsSQL Server Abbreviations
SQL Server Abbreviations
 
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
Optimize the Large Scale Graph Applications by using Apache Spark with 4-5x P...
 
The Data Web and PLM
The Data Web and PLMThe Data Web and PLM
The Data Web and PLM
 
No SQL- The Future Of Data Storage
No SQL- The Future Of Data StorageNo SQL- The Future Of Data Storage
No SQL- The Future Of Data Storage
 
Building the DW - ETL
Building the DW - ETLBuilding the DW - ETL
Building the DW - ETL
 
Online analytical processing (olap) tools
Online analytical processing (olap) toolsOnline analytical processing (olap) tools
Online analytical processing (olap) tools
 
Operationalizing Big Data
Operationalizing Big DataOperationalizing Big Data
Operationalizing Big Data
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedIn
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bw
 

Destaque

VADER 2011 (Younessi)
VADER 2011 (Younessi)VADER 2011 (Younessi)
VADER 2011 (Younessi)Carlos Cuesta
 
MI COMPUTADOR IDEAL
MI  COMPUTADOR  IDEALMI  COMPUTADOR  IDEAL
MI COMPUTADOR IDEALjulipita
 
Powers 5 13 dissertation presentation
Powers 5 13 dissertation presentationPowers 5 13 dissertation presentation
Powers 5 13 dissertation presentationShawn Powers
 
PITA Y SU MÁQUINA
PITA Y  SU  MÁQUINAPITA Y  SU  MÁQUINA
PITA Y SU MÁQUINAjulipita
 
Useful v. beautiful
Useful v. beautifulUseful v. beautiful
Useful v. beautifulShawn Powers
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic TechnologiesPeter Haase
 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
 
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...Robert Cole
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Building Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge GraphsBuilding Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge Graphskjanowicz
 

Destaque (12)

SOAR 2009 (Cuesta)
SOAR 2009 (Cuesta)SOAR 2009 (Cuesta)
SOAR 2009 (Cuesta)
 
ECSA 2011 (Navarro)
ECSA 2011 (Navarro)ECSA 2011 (Navarro)
ECSA 2011 (Navarro)
 
VADER 2011 (Younessi)
VADER 2011 (Younessi)VADER 2011 (Younessi)
VADER 2011 (Younessi)
 
MI COMPUTADOR IDEAL
MI  COMPUTADOR  IDEALMI  COMPUTADOR  IDEAL
MI COMPUTADOR IDEAL
 
Powers 5 13 dissertation presentation
Powers 5 13 dissertation presentationPowers 5 13 dissertation presentation
Powers 5 13 dissertation presentation
 
PITA Y SU MÁQUINA
PITA Y  SU  MÁQUINAPITA Y  SU  MÁQUINA
PITA Y SU MÁQUINA
 
Useful v. beautiful
Useful v. beautifulUseful v. beautiful
Useful v. beautiful
 
On demand access to Big Data through Semantic Technologies
 On demand access to Big Data through Semantic Technologies On demand access to Big Data through Semantic Technologies
On demand access to Big Data through Semantic Technologies
 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
 
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
ATME Travel Marketing Conference - How Big Data, Deep Web & Semantic Technolo...
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Building Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge GraphsBuilding Blocks for Distributed Geo-Knowledge Graphs
Building Blocks for Distributed Geo-Knowledge Graphs
 

Semelhante a ECSA 2013 (Cuesta)

IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...In-Memory Computing Summit
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersDenodo
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap IT Strategy Group
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtGenoveva Vargas-Solar
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.Navdeep Charan
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analyticsramikaurraminder
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Denodo
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference KeynoteKingsley Uyi Idehen
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentationSalma Gouia
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbsVasilios Kuznos
 
Bridging the gap between the semantic web and big data: answering SPARQL que...
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...IJECEIAES
 
Data Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified InsightsData Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified InsightsDenodo
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLbalwinders
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureDenodo
 
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationRichard Cyganiak
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFOpenLink Software
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 DataWorks Summit
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkAgnihotriGhosh2
 

Semelhante a ECSA 2013 (Cuesta) (20)

IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
 
HadoopDB in Action
HadoopDB in ActionHadoopDB in Action
HadoopDB in Action
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap Vikram Andem Big Data Strategy @ IATA Technology Roadmap
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
 
Vargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbtVargas polyglot-persistence-cloud-edbt
Vargas polyglot-persistence-cloud-edbt
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
Virtuoso, The Prometheus of RDF -- Sematics 2014 Conference Keynote
 
No sqlpresentation
No sqlpresentationNo sqlpresentation
No sqlpresentation
 
Comparing sql and nosql dbs
Comparing sql and nosql dbsComparing sql and nosql dbs
Comparing sql and nosql dbs
 
Bridging the gap between the semantic web and big data: answering SPARQL que...
Bridging the gap between the semantic web and big data:  answering SPARQL que...Bridging the gap between the semantic web and big data:  answering SPARQL que...
Bridging the gap between the semantic web and big data: answering SPARQL que...
 
Data Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified InsightsData Lakes: A Logical Approach for Faster Unified Insights
Data Lakes: A Logical Approach for Faster Unified Insights
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integrationSigma EE: Reaping low-hanging fruits in RDF-based data integration
Sigma EE: Reaping low-hanging fruits in RDF-based data integration
 
Virtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDFVirtuoso -- The Prometheus of RDF
Virtuoso -- The Prometheus of RDF
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 

Mais de Carlos Cuesta

JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)Carlos Cuesta
 
JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)Carlos Cuesta
 
Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)Carlos Cuesta
 
Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)Carlos Cuesta
 
VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)Carlos Cuesta
 
VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)Carlos Cuesta
 

Mais de Carlos Cuesta (7)

JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)JITICE 2015 (Cuesta)
JITICE 2015 (Cuesta)
 
JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)JITICE 2014 (Cuesta)
JITICE 2014 (Cuesta)
 
Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)Redes Sociales 2014 (Cuesta)
Redes Sociales 2014 (Cuesta)
 
Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)Semana de la Ciencia 2014 (Martínez-Prieto)
Semana de la Ciencia 2014 (Martínez-Prieto)
 
SESoS 2013 (Romay)
SESoS 2013 (Romay)SESoS 2013 (Romay)
SESoS 2013 (Romay)
 
VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)VADER 2011 (Pérez-Sotelo)
VADER 2011 (Pérez-Sotelo)
 
VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)VADER 2011 (Moreno-Rivera)
VADER 2011 (Moreno-Rivera)
 

Último

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...Pooja Nehwal
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Último (20)

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...Russian Call Girls in Andheri Airport Mumbai WhatsApp  9167673311 💞 Full Nigh...
Russian Call Girls in Andheri Airport Mumbai WhatsApp 9167673311 💞 Full Nigh...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

ECSA 2013 (Cuesta)

  • 1. TOWARDS AN ARCHITECTURE FOR MANAGING BIG SEMANTIC DATA IN REAL-TIME Carlos E. Cuesta, VorTIC3, URJC, Spain Miguel A. Martínez-Prieto, UVa, Spain Javier D. Fernández, UVa, Spain & UChile, Chile Montpellier, France, 02/07/2013
  • 2. CONTENTS  Introduction  Problem Statement  Context: the RDF world  Proposal: SOLID Architecture  Unfolding in five Layers  SOLID in Practice  The RDF/HDT format  The SOLID/HDT Architecture  Conclusions & Future work 2
  • 3. INTRODUCTION  Big Data has become an important topic  When the size of the data itself becomes part of the problem (Loukides)  Characterized by the “three Vs”  Volume: large amounts of data gathered and stored  The challenge is storage, but also computing  Volume is relative: depends on available resources  Velocity: different flows of data at different rates  Variety: the kind of structures within the data  Each source has its own semantics  Need of a logical model to allow data integration  Architecture for Big Data must consider all these 3
  • 4. INTRODUCTION  One of the dimensions gets always critical  E.g. storage in mobile applications, velocity in real- time applications (vs. batch processes)  We promote variety  The dataset value is increased when multiple sources are integrated, achieving more knowledge  This also influences velocity and volume  We choose a graph-based model  Allows to manage higher levels of variety  Data can be linked and queried together  In practice, this means using RDF as data model  The cornerstone of the “practical” Semantic Web  The basis of the emergent Web of Data 4
  • 5. PROBLEM STATEMENT  Most solutions to manage Big Data intend to maximize the volume dimension  Therefore promoting efficient storage  Datastores able to cope with large datasets  Indexing strategies to achieve high space  Datastores must be assumed to be stable  In spite of the assumed immutability property  But, the volume of incoming data is also big  Datastores must be periodically updated & reindexed  This is very complex in a Real-Time context  Data must be received and integrated in real time  No time to process the flow of incoming data 5
  • 6. OUR PROPOSAL: SOLID ARCHITECTURE  We propose an specific architecture to manage Real-Time flows in this context  A multi-tiered architecture  Separate comsuption of Big Semantic Data…  … from the complexities of Real-Time operation  Data must be preserved compact  It is stored and indexed in a compressed way  Data & Index Layers  Needs to efficiently cope with data updates  The reason for the Online Layer  Needs to query all of this together  The reason for the Service Layer 6
  • 7. CONTEXT: RDF  RDF: Resource Description Framework  Data described as (subject, predicate, object) triples  An RDF dataset is a graph of knowledge  Entities linked to values via labelled edges  Essential for Linked Open Data  Adopted in many different contexts  Simple integration: everything has an URI 7 John Car owns
  • 8. CONTEXT: RDF  The origin of the Web of Data  Two datasets can become connected by a single triple <“Station #123, location, Canal Street>  The web becomes data-centric  Every unit is a small piece of data  “The Big Data’s long tail”  But their integration in large contexts become complex: Big Semantic Data  A variety of sources become easily integrated  RDF is not a serialization format  Describes what data is stored, not how this is done 8
  • 9. SOLID ARCHITECTURE 10 INDEX LAYER New Data Dump Rd DataStore DATA LAYER Big Data MERGE LAYER (BATCH) Query Join SERVICE LAYER ONLINE LAYER Parallelizable Processing
  • 10. SOLID ARCHITECTURE 11 INDEX LAYER New Data Dump Rd DataStore DATA LAYER Big Data MERGE LAYER (BATCH) Query Join SERVICE LAYER ONLINE LAYER Parallelizable Processing RDF SPARQL
  • 11. SOLID ARCHITECTURE  Online Layer  Receives incoming new data  Deals with real-time needs  Data Layer  The core of the architecture  The main datastore: the Big Data repository  Stores compressed RDF  Index Layer  Provides an index for the Data Layer, to make possible high-speed access  Most accesses to the repository are made through it 12
  • 12. SOLID ARCHITECTURE  Service Layer  The façade to the external user  Able to ask federated SPARQL queries to the separate datastores in different layers  Every query is distributed, and the different answers are joined  Merge Layer  Makes possible to integrate the two datastores  Receives a dump of data of the online layer  Integrates that with the existing repository  Producing a fresh copy of the Data Layer  Immutability properties are preserved 13
  • 13. SOLID IN PRACTICE  This abstract architecture is possible due to application to existing technology  In particular, the RDF/HDT binary format  Decisions must be taken, layer by layer, about how to actually implement it  Other alternatives would also be possible (and some of them are also being implemented)  Data-Centric Layers  Do not use a textual RDF representation  Inefficient, prevents some potential uses  RDF/HDT is a binary format  Conceived specifically for serialization purposes 14
  • 14. SOLID IN PRACTICE  RDF/HDT format  Designed for machine processing  About 15 times less space than equivalent formats  Uses compact (compressed) data structures  Data Layer  Big Semantic Data in RDF/HDT  Data saving and guaranteed immutability  Instant mapping to memory  Allow querying withoug decompressing  Index Layer  Implements the HDT/FoQ proposal  Lightweight index on top of the HDT binary format  Efficient SPARQL retrieval without decompressing 15
  • 15. SOLID IN PRACTICE  Online Layer  Copes with the incoming flow of real-time data  HDT is inadequate (designed for read-only)  Must resolve SPARQL efficiently  Choose a general-purpose NoSQL technology  Still able to dump data in an RDF format  Service Layer  Resolves any potential queries  SPARQL considered expressive enough  Queries are forwarded to Online and Index Layers  Their results are retrieved and combined  Using an (scalable) Pipe-Filter approach 16
  • 16. SOLID IN PRACTICE  Merge Layer  Able to combine incoming data from the Online Layer with the existing datastore in the Data Layer  The data dump is merged into a copy of the datastore  Then the fresh datastore replaces the previous one  Periodical process, can also be manually triggered  Requires high-performance computation  In practice, this means a Map/Reduce approach  Raw RDF data from Online Layer is converted  Then ordered for internal merging  Depends on the size of the smaller store  Also triggers reindexing the Index Layer 17
  • 17. SOLID ARCHITECTURE IN PRACTICE 18 INDEX LAYER New Data Dump Rd NoSQL DATA LAYER RDF/HDT MERGE LAYER (BATCH) HADOOP SPARQL SPARQL + P/F SERVICE LAYER ONLINE LAYER Semantic Data
  • 18. CONCLUSIONS & FUTURE WORK  We propose SOLID as a generic architecture for managing Big Semantic Data  Our particular implementation relies on HDT  Also NoSQL for real-time incoming data  Cassandra, but (still) not the only choice  Map/Reduce (Hadoop) for intensive processing  Highly effective in terms of space & time  Initial empirical results are very significant  Currently developing an optimized prototype  Already working on variants of the architecture  Limited version for mobile devices  The Merge Layer is not directly requred 19
  • 19. THANKS FOR YOUR ATTENTION 20