SlideShare uma empresa Scribd logo
1 de 3
Baixar para ler offline
Dremel - Interactive Analysis of
               Web-Scale Dataset
                                    Paper Review

                                  Arinto Murdopo

                                 November 8, 2012


1 Motivation
The main motivation of the paper is the inability of existing big data infrastructure
(MapReduce-BigTable and Hadoop) to perform fast ad-hoc explorations/queries into
web-scale data-sets. In this context, ad-hoc explorations/queries mean on-the-fly queries,
which are issued by users when they need it. The execution time of ad-hoc queries is
expected to be fast so that users can interactively explore the datasets.
  Secondary motivation of the paper is the needs of more user-friendly query mechanism
to make data analytic in big data infrastructure easier and faster. Pig and Hive try to
solve this challenge by providing SQL-like query language in Hadoop, but they only solve
the usability issue and not the performance issue.
  Solving these issues allows faster, more efficient and more effective data analytic in big
data infrastructure which implies higher productivity for data scientists and engineers
who are working in big data analytic. Hence, these issues are interesting and important
to be solved.


2 Contributions
The main contribution of Dremel is high-performance technique in processing web-scale
data-set for ad-hoc query usage. The mechanism solves these following challenges:

  1. Inefficiency in storing data for ad-hoc query.
     Ad-hoc queries most of the time do not need all the available field/column in
     a table. Therefore the authors propose columnar data model that improve data
     retrieval performance. It was novel solution because well-known big data processing
     platforms (such as MapReduce on Hadoop) work at record-structure data model
     at that time.

  2. Performance and scalability challenge in processing the data.



                                           1
Multi-level serving tree model allows high parallelism in processing the data. It
      also allows optimization of query in each of the tree level. Query scheduling allows
      prioritization of the execution. Both techniques solve performance challenge in
      data processing.
      Scalability challenge is also solved since we can easily add more nodes to process
      more data in multi-level serving tree model. Separation of query scheduler from
      root server is good because it decouples query scheduling responsibility with job
      tracking (note that: job tracking is performed by root server). This scheduler
      model implies higher scalability when the number of leaf servers is huge.

  The minor contribution of this paper is the use of actual Google data set in the experi-
ment. This usage gives other researchers insight on the practical magnitude of web-scale
data-sets


3 Solutions
3.1 Columnar Data Model
As mentioned before, ad-hoc queries only need small subset of fields/columns in tables.
Record-structure data model introduces significant inefficiencies. The reason is in order
to retrieve the needed fields/columns, we need to read the whole record data, including
unnecessary fields/columns. To reduce these inefficiencies, columnar data model is intro-
duced. Columnar data model allows us to read only the needed fields/columns for ad hoc
queries. This model will reduce the data retrieval inefficiencies and increase the speed.
The authors also explain how to convert from record-structure data model to columnar
model and vice versa.

3.2 Multi-level Serving Tree
The authors use multi-level serving tree in columnar data processing. Typical multi-level
serving tree consists of a root server, several intermediate servers and many leaf servers.
There are two reasons behind multi-level serving tree:

  1. Characteristics of ad hoc queries where the result set size is small or medium. The
     overhead of processing these kinds of data in parallel is small.

  2. High degree of parallelism to process small or medium-size data.

3.3 Query Dispatcher
The authors propose a mechanism to regulate resource allocation for each leaf servers and
the mechanism is handled by module called query dispatcher. Query dispatcher works
in slot(number of processing unit available for execution) unit. It allocates appropriate
number of tablets into their respective slot. It deals with stragglers by moving the tablets
from slow straggler slots into new slots.



                                             2
3.4 SQL-like query
People with SQL background can easily use Dremel to perform data analytics in web-
scale datasets. Unlike Pig and Hive, Dremel does not convert SQL-like queries into
MapReduce jobs, therefore Dremel should have faster execution time compared to Pig
and Hive.


4 Strong Points
  1. Identification of the characteristics of ad-hoc queries data set.
     The authors correctly identify the main characteristic of data set returned from
     ad-hoc queries, which is: only small number of fields are used by ad hoc-queries.
     This finding allows the authors to develop columnar data model and use multi-level
     serving tree to process the columnar data model.
  2. Fast and lossless conversion between nested record structure model and columnar
     data model.
     Although columnar data model has been used in other related works, the fast and
     lossless conversion algorithm that the authors propose is novel and one of the key
     contributions of Dremel.
  3. Magnitude and variety of datasets for experiments.
     The magnitude of datasets is huge and practical. These magnitude and variety
     of data-sets increase the confirming power of Dremel solution and proof its high
     performance.


5 Weak Points
  1. Record-oriented data model can still outperform columnar data model. This is
     the main shortcoming of Dremel, however credit must be given on the authors
     since they do not hide this shortcoming and they provide some insight on this
     shortcoming.
  2. Performance analysis on the data model conversion is not discussed. They claim
     that the conversion is fast, but they do not support this argument using experiment
     data.
  3. "Cold" setting usage in Local Disk experiment. In Local Disk experiment, the au-
     thors mention that "all reported times are cold". Using cold setting in database
     or storage benchmarking is not recommended because the data is highly biased
     with disk access performance. When the database is "cold", query execution per-
     formance in the start of the experiment will highly depend on the disk speed. In
     the start of the experiment, most of the operations involve moving data from disk
     to OS cache, and the execution performance will be dominated by disk access.




                                           3

Mais conteúdo relacionado

Mais procurados

Performance tuning and optimization on client server
Performance tuning and optimization on client serverPerformance tuning and optimization on client server
Performance tuning and optimization on client serverSatya P. Joshi
 
Using ZFS file system with MySQL
Using ZFS file system with MySQLUsing ZFS file system with MySQL
Using ZFS file system with MySQLMydbops
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesYoav Francis
 
Database monitoring and performance management
Database monitoring and performance managementDatabase monitoring and performance management
Database monitoring and performance managementAnne Lee
 
Load Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseLoad Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseMd. Shamsur Rahim
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesDevyani Vaidya
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFSzihad164
 

Mais procurados (14)

Performance tuning and optimization on client server
Performance tuning and optimization on client serverPerformance tuning and optimization on client server
Performance tuning and optimization on client server
 
Parallel databases
Parallel databasesParallel databases
Parallel databases
 
Using ZFS file system with MySQL
Using ZFS file system with MySQLUsing ZFS file system with MySQL
Using ZFS file system with MySQL
 
Cs1011 dw-dm-1
Cs1011 dw-dm-1Cs1011 dw-dm-1
Cs1011 dw-dm-1
 
CAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and PracticesCAP Theorem - Theory, Implications and Practices
CAP Theorem - Theory, Implications and Practices
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deadlock in database
Deadlock in databaseDeadlock in database
Deadlock in database
 
Google file system
Google file systemGoogle file system
Google file system
 
Database monitoring and performance management
Database monitoring and performance managementDatabase monitoring and performance management
Database monitoring and performance management
 
Load Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed DatabaseLoad Balancing in Parallel and Distributed Database
Load Balancing in Parallel and Distributed Database
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Cosequential processing and the sorting of large files
Cosequential processing and the sorting of large filesCosequential processing and the sorting of large files
Cosequential processing and the sorting of large files
 
Cassandra
CassandraCassandra
Cassandra
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
 

Destaque

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Uso correto de epi´s abafadores
Uso correto de epi´s   abafadoresUso correto de epi´s   abafadores
Uso correto de epi´s abafadoresPaulo Carvalho
 
Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.persi-10
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Arinto Murdopo
 
Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015Kelvin Glen
 
Arviointi ja palaute 2011
Arviointi ja palaute 2011Arviointi ja palaute 2011
Arviointi ja palaute 2011Marko Havu
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Arinto Murdopo
 
The counting system for small animals in japanese
The counting system for small animals in japaneseThe counting system for small animals in japanese
The counting system for small animals in japaneseCheyanneStotlar
 
Pankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittelyPankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittelyPankki2
 
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8yaying-yingg
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArinto Murdopo
 
Cultura mites
Cultura mitesCultura mites
Cultura mitesComalat1D
 
153 test plan
153 test plan153 test plan
153 test plan< <
 
Moodboards eda
Moodboards edaMoodboards eda
Moodboards edaedaozdemir
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksArinto Murdopo
 
Maailmassa on parempia pankkeja
Maailmassa on parempia pankkejaMaailmassa on parempia pankkeja
Maailmassa on parempia pankkejaPankki2
 

Destaque (20)

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Uso correto de epi´s abafadores
Uso correto de epi´s   abafadoresUso correto de epi´s   abafadores
Uso correto de epi´s abafadores
 
Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015
 
 
UX homework4
UX homework4UX homework4
UX homework4
 
Arviointi ja palaute 2011
Arviointi ja palaute 2011Arviointi ja palaute 2011
Arviointi ja palaute 2011
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..
 
Pechakucha
PechakuchaPechakucha
Pechakucha
 
The counting system for small animals in japanese
The counting system for small animals in japaneseThe counting system for small animals in japanese
The counting system for small animals in japanese
 
Sam houston chess team
Sam houston chess teamSam houston chess team
Sam houston chess team
 
Pankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittelyPankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittely
 
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity Fabric
 
Cultura mites
Cultura mitesCultura mites
Cultura mites
 
153 test plan
153 test plan153 test plan
153 test plan
 
Moodboards eda
Moodboards edaMoodboards eda
Moodboards eda
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Maailmassa on parempia pankkeja
Maailmassa on parempia pankkejaMaailmassa on parempia pankkeja
Maailmassa on parempia pankkeja
 

Semelhante a Dremel Paper Review

De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataEMC
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseAlireza Kamrani
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performanceijcsa
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementijdms
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6varshakumar21
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...avinash varma sagi
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSijcsit
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONcscpconf
 
IJSRED-V2I3P84
IJSRED-V2I3P84IJSRED-V2I3P84
IJSRED-V2I3P84IJSRED
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environmentDavid Walker
 
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...Arjun Sirohi
 

Semelhante a Dremel Paper Review (20)

De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of database
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query management
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
 
Query Optimization for Big Data Analytics
Query Optimization for Big Data AnalyticsQuery Optimization for Big Data Analytics
Query Optimization for Big Data Analytics
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
 
IJSRED-V2I3P84
IJSRED-V2I3P84IJSRED-V2I3P84
IJSRED-V2I3P84
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Facade
FacadeFacade
Facade
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environment
 
disertation
disertationdisertation
disertation
 
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
 

Mais de Arinto Murdopo

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsArinto Murdopo
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Arinto Murdopo
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARNArinto Murdopo
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...Arinto Murdopo
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideArinto Murdopo
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIArinto Murdopo
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 PresentationArinto Murdopo
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event ScalabilityArinto Murdopo
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideArinto Murdopo
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsArinto Murdopo
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network VirtualizationArinto Murdopo
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingArinto Murdopo
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Arinto Murdopo
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Arinto Murdopo
 

Mais de Arinto Murdopo (19)

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slide
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network Virtualization
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer Computing
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 

Último

Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 

Último (20)

Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 

Dremel Paper Review

  • 1. Dremel - Interactive Analysis of Web-Scale Dataset Paper Review Arinto Murdopo November 8, 2012 1 Motivation The main motivation of the paper is the inability of existing big data infrastructure (MapReduce-BigTable and Hadoop) to perform fast ad-hoc explorations/queries into web-scale data-sets. In this context, ad-hoc explorations/queries mean on-the-fly queries, which are issued by users when they need it. The execution time of ad-hoc queries is expected to be fast so that users can interactively explore the datasets. Secondary motivation of the paper is the needs of more user-friendly query mechanism to make data analytic in big data infrastructure easier and faster. Pig and Hive try to solve this challenge by providing SQL-like query language in Hadoop, but they only solve the usability issue and not the performance issue. Solving these issues allows faster, more efficient and more effective data analytic in big data infrastructure which implies higher productivity for data scientists and engineers who are working in big data analytic. Hence, these issues are interesting and important to be solved. 2 Contributions The main contribution of Dremel is high-performance technique in processing web-scale data-set for ad-hoc query usage. The mechanism solves these following challenges: 1. Inefficiency in storing data for ad-hoc query. Ad-hoc queries most of the time do not need all the available field/column in a table. Therefore the authors propose columnar data model that improve data retrieval performance. It was novel solution because well-known big data processing platforms (such as MapReduce on Hadoop) work at record-structure data model at that time. 2. Performance and scalability challenge in processing the data. 1
  • 2. Multi-level serving tree model allows high parallelism in processing the data. It also allows optimization of query in each of the tree level. Query scheduling allows prioritization of the execution. Both techniques solve performance challenge in data processing. Scalability challenge is also solved since we can easily add more nodes to process more data in multi-level serving tree model. Separation of query scheduler from root server is good because it decouples query scheduling responsibility with job tracking (note that: job tracking is performed by root server). This scheduler model implies higher scalability when the number of leaf servers is huge. The minor contribution of this paper is the use of actual Google data set in the experi- ment. This usage gives other researchers insight on the practical magnitude of web-scale data-sets 3 Solutions 3.1 Columnar Data Model As mentioned before, ad-hoc queries only need small subset of fields/columns in tables. Record-structure data model introduces significant inefficiencies. The reason is in order to retrieve the needed fields/columns, we need to read the whole record data, including unnecessary fields/columns. To reduce these inefficiencies, columnar data model is intro- duced. Columnar data model allows us to read only the needed fields/columns for ad hoc queries. This model will reduce the data retrieval inefficiencies and increase the speed. The authors also explain how to convert from record-structure data model to columnar model and vice versa. 3.2 Multi-level Serving Tree The authors use multi-level serving tree in columnar data processing. Typical multi-level serving tree consists of a root server, several intermediate servers and many leaf servers. There are two reasons behind multi-level serving tree: 1. Characteristics of ad hoc queries where the result set size is small or medium. The overhead of processing these kinds of data in parallel is small. 2. High degree of parallelism to process small or medium-size data. 3.3 Query Dispatcher The authors propose a mechanism to regulate resource allocation for each leaf servers and the mechanism is handled by module called query dispatcher. Query dispatcher works in slot(number of processing unit available for execution) unit. It allocates appropriate number of tablets into their respective slot. It deals with stragglers by moving the tablets from slow straggler slots into new slots. 2
  • 3. 3.4 SQL-like query People with SQL background can easily use Dremel to perform data analytics in web- scale datasets. Unlike Pig and Hive, Dremel does not convert SQL-like queries into MapReduce jobs, therefore Dremel should have faster execution time compared to Pig and Hive. 4 Strong Points 1. Identification of the characteristics of ad-hoc queries data set. The authors correctly identify the main characteristic of data set returned from ad-hoc queries, which is: only small number of fields are used by ad hoc-queries. This finding allows the authors to develop columnar data model and use multi-level serving tree to process the columnar data model. 2. Fast and lossless conversion between nested record structure model and columnar data model. Although columnar data model has been used in other related works, the fast and lossless conversion algorithm that the authors propose is novel and one of the key contributions of Dremel. 3. Magnitude and variety of datasets for experiments. The magnitude of datasets is huge and practical. These magnitude and variety of data-sets increase the confirming power of Dremel solution and proof its high performance. 5 Weak Points 1. Record-oriented data model can still outperform columnar data model. This is the main shortcoming of Dremel, however credit must be given on the authors since they do not hide this shortcoming and they provide some insight on this shortcoming. 2. Performance analysis on the data model conversion is not discussed. They claim that the conversion is fast, but they do not support this argument using experiment data. 3. "Cold" setting usage in Local Disk experiment. In Local Disk experiment, the au- thors mention that "all reported times are cold". Using cold setting in database or storage benchmarking is not recommended because the data is highly biased with disk access performance. When the database is "cold", query execution per- formance in the start of the experiment will highly depend on the disk speed. In the start of the experiment, most of the operations involve moving data from disk to OS cache, and the execution performance will be dominated by disk access. 3