Google's Dremel

•Download as ODP, PDF•

32 likes•15,041 views

Maria Stylianou

Course: Advanced Topics in Distributed Computing 30-minute presentation

Technology

Dremel
Interactive Analysis
of Web-Scale Datasets
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
Shivakumar, Matt Tolton, Theo Vassilakis

Presented by Maria Stylianou
marsty5@gmail.com
November 8th, 2012

KTH – Royal Institute of Technology

Outline
● Motivation

● Dremel – basic information
● Dremel's Key Aspects
– Columnar Format
– Query Execution

● Evaluation & Conclusions 2

Motivation

Data Big Data
● Web-scale Datasets → more frequent
● Large-scale Data Analysis → essential!

NOT
FAST
Speed Matters! 3

Dremel to the rescue!
● Interactive ad-hoc query system
Scalable Fault tolerant Fast

Access data
'in place'
● Analysis on in situ nested data

Non
relational
4

Key Aspects of Dremel
● Storage Format
– Columnar storage representation for nested
data

● Query Language & Execution
– SQL & Multi-level serving tree

6

Storage Format
Columnar Storage Representation

7

Data Model
● Based on strongly-typed nested records
schema

Repetition
Level
Definition
Level records

Query Language & Execution
SQL & Multi-level Serving Tree
Tablet
Contains
N rows from
the table

9

Query Execution
Query Dispatcher

● Schedules queries based on their priorities
● Balances the load
Servers
● Provides fault tolerance running
– Handles stragglers slow
– Tablets are three-way replicated

10

Experiments
Local Disk - Performance

12

Experiments
MapReduce and Dremel

Counts the average number
of terms in a specific field

3000 workers
hours
minutes

seconds

13

Experiments
Scalability

Selects top-20 adverts and
Their number of occurrences
In T4

15

What's happening today?
● Google BigQuery
– Web Service [pay-per-query]

● Open Dremel → Apache Drill
– Open Source Implementation
of Google BigQuery
– Flexibility: broader range of query languages

16

MapReduce or Dremel
or both
?
MR Dremel
Data Processing Record Column
Oriented Oriented
In-situ Processing No Yes!

Size of Queries Large Small/Medium

MapReduce AND Dremel 17

Conclusions
Multi-level Columnar
Execution Data
trees Layout

Scalable & Efficient
MapReduce benefits
Near-linear scalability

18

References
● S. Melnik et al. Dremel: Interactive Analysis of Web-
Scale Datasets. PVLDB, 3(1):330–339, 2010
●
G. Czajkowski. Sorting 1PB with MapReduce.
http://googleblog.blogspot.se/2008/11/sorting-1pb-with-mapreduce.html

● Apache Drill, http://wiki.apache.org/incubator/DrillProposal
● Google BigQuery, https://developers.google.com/bigquery/

What's hot

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

Real-time Analytics with Trino and Apache PinotXiang Fu

Dichotomy of parallel computing platformsSyed Zaid Irshad

Mining high speed data streams: Hoeffding and VFDTDavide Gallitelli

Chapter 10AbDul ThaYyal

Apache Pinot Meetup Sept02, 2020Mayank Shrivastava

Apache Flink: API, runtime, and project roadmapKostas Tzoumas

DNS Security Presentation ISSASrikrupa Srivatsan

Interconnection NetworkHeman Pathak

Apache Arrow - An OverviewDremio Corporation

Unified Batch & Stream Processing with Apache SamzaDataWorks Summit

Google File SystemJunyoung Jung

HDFS ArchitectureJeff Hammerbacher

Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks

Building Software Systems at Google and Lessons Learnedparallellabs

CS6010 Social Network Analysis Unit IIpkaviya

Introduction to Structured StreamingKnoldus Inc.

Flink Batch Processing and IterationsSameer Wadkar

Map reduce in BIG DATAGauravBiswas9

Presto query optimizer: pursuit of performanceDataWorks Summit

What's hot (20)

Introduction to Apache Flink - Fast and reliable big data processing

Real-time Analytics with Trino and Apache Pinot

Dichotomy of parallel computing platforms

Mining high speed data streams: Hoeffding and VFDT

Chapter 10

Apache Pinot Meetup Sept02, 2020

Apache Flink: API, runtime, and project roadmap

DNS Security Presentation ISSA

Interconnection Network

Apache Arrow - An Overview

Unified Batch & Stream Processing with Apache Samza

Google File System

HDFS Architecture

Cosco: An Efficient Facebook-Scale Shuffle Service

Building Software Systems at Google and Lessons Learned

CS6010 Social Network Analysis Unit II

Introduction to Structured Streaming

Flink Batch Processing and Iterations

Map reduce in BIG DATA

Presto query optimizer: pursuit of performance

Similar to Google's Dremel

Challenges in Large Scale Machine LearningSudarsun Santhiappan

DremelAnhua Xu

Next generation analytics with yarn, spark and graph labImpetus Technologies

SparkMário Almeida

Spark Summit EU talk by Ahsan Javed AwanSpark Summit

MapR & Skytree: MapR Technologies

Distributed computing abstractions_data_science_6_june_2016_ver_0.4Vijay Srinivas Agneeswaran, Ph.D

The elephantintheroom bigdataanalyticsinthecloudKhazret Sapenov

Drill lightning-london-big-data-10-01-2012Ted Dunning

Is Spark the right choice for data analysis ?Ahmed Kamal

Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk

AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu

NO SQL: What, Why, HowIgor Moochnick

Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies

Notes on data-intensive processing with Hadoop MapreduceEvert Lammerts

Hadoop.mapreduceMichael Hepburn

Productionizing Deep Learning From the Ground Upodsc

IaaS Cloud Benchmarking: Approaches, Challenges, and ExperienceAlexandru Iosup

Introduction to map reduceTrendProgContest13

Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine

Similar to Google's Dremel (20)

Challenges in Large Scale Machine Learning

Dremel

Next generation analytics with yarn, spark and graph lab

Spark

Spark Summit EU talk by Ahsan Javed Awan

MapR & Skytree:

Distributed computing abstractions_data_science_6_june_2016_ver_0.4

The elephantintheroom bigdataanalyticsinthecloud

Drill lightning-london-big-data-10-01-2012

Is Spark the right choice for data analysis ?

Secrets of Spark's success - Deenar Toraskar, Think Reactive

AI on Greenplum Using  Apache MADlib and MADlib Flow - Greenplum Summit 2019

NO SQL: What, Why, How

Big Data Analytics with Storm, Spark and GraphLab

Notes on data-intensive processing with Hadoop Mapreduce

Hadoop.mapreduce

Productionizing Deep Learning From the Ground Up

IaaS Cloud Benchmarking: Approaches, Challenges, and Experience

Introduction to map reduce

Unexpected Challenges in Large Scale Machine Learning by Charles Parker

Recently uploaded

🐬 The future of MySQL is Postgres 🐘RTylerCroy

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

A Domino Admins Adventures (Engage 2024)Gabriella Davis

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Histor y of HAM Radio presentation slidevu2urc

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘

SQL Database Design For Developers at php[tek] 2024

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Unblocking The Main Thread Solving ANRs and Frozen Frames

Breaking the Kubernetes Kill Chain: Host Path Mount

Google AI Hackathon: LLM based Evaluator for RAG

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

My Hashitalk Indonesia April 2024 Presentation

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

CNv6 Instructor Chapter 6 Quality of Service

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Presentation on how to chat with PDF using ChatGPT code interpreter

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

A Domino Admins Adventures (Engage 2024)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Histor y of HAM Radio presentation slide

Google's Dremel

1. Dremel Interactive Analysis of Web-Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by Maria Stylianou marsty5@gmail.com November 8th, 2012 KTH – Royal Institute of Technology

2. Outline ● Motivation ● Dremel – basic information ● Dremel's Key Aspects – Columnar Format – Query Execution ● Evaluation & Conclusions 2

3. Motivation Data Big Data ● Web-scale Datasets → more frequent ● Large-scale Data Analysis → essential! NOT FAST Speed Matters! 3

4. Dremel to the rescue! ● Interactive ad-hoc query system Scalable Fault tolerant Fast Access data 'in place' ● Analysis on in situ nested data Non relational 4

5. MapReduce or Dremel or both ? 5

6. Key Aspects of Dremel ● Storage Format – Columnar storage representation for nested data ● Query Language & Execution – SQL & Multi-level serving tree 6

7. Storage Format Columnar Storage Representation 7

8. Data Model ● Based on strongly-typed nested records schema Repetition Level Definition Level records

9. Query Language & Execution SQL & Multi-level Serving Tree Tablet Contains N rows from the table 9

10. Query Execution Query Dispatcher ● Schedules queries based on their priorities ● Balances the load Servers ● Provides fault tolerance running – Handles stragglers slow – Tablets are three-way replicated 10

11. Experiments Environment 11

12. Experiments Local Disk - Performance 12

13. Experiments MapReduce and Dremel Counts the average number of terms in a specific field 3000 workers hours minutes seconds 13

14. Experiments Impact of Stragglers 14

15. Experiments Scalability Selects top-20 adverts and Their number of occurrences In T4 15

16. What's happening today? ● Google BigQuery – Web Service [pay-per-query] ● Open Dremel → Apache Drill – Open Source Implementation of Google BigQuery – Flexibility: broader range of query languages 16

17. MapReduce or Dremel or both ? MR Dremel Data Processing Record Column Oriented Oriented In-situ Processing No Yes! Size of Queries Large Small/Medium MapReduce AND Dremel 17

18. Conclusions Multi-level Columnar Execution Data trees Layout Scalable & Efficient MapReduce benefits Near-linear scalability 18

19. Dremel Interactive Analysis of Web-Scale Datasets Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by Maria Stylianou marsty5@gmail.com November 8th, 2012 KTH – Royal Institute of Technology

20. References ● S. Melnik et al. Dremel: Interactive Analysis of Web- Scale Datasets. PVLDB, 3(1):330–339, 2010 ● G. Czajkowski. Sorting 1PB with MapReduce. http://googleblog.blogspot.se/2008/11/sorting-1pb-with-mapreduce.html ● Apache Drill, http://wiki.apache.org/incubator/DrillProposal ● Google BigQuery, https://developers.google.com/bigquery/

Editor's Notes

- Hello everybody. I will present Dremel, a tool developed in Google, - It is being used at Google since 2006 - But the paper was published in 2010
Let's briefly see the outline of the presentation. I will start with the motivation of the authors do develop Dremel Then I will explain what is Dremel and which are the key aspects that make Dremel to be novel I will continue with with the evaluation, showing some of the experiments the authors contacted to support their idea. And of course I will close my presentation with some observations and conclusions
Their motivation begun with the observation that data are becoming BIG Web-scale Datasets are becoming more frequent Performing Data analysis at scale is essential As you may know Pig and Hive can perform ad-hoc queries into web-scale datasets BUT they are NOT FAST This is because they translate queries into MapReduce jobs, which makes the execution slower The thing is... Speed Matters! So, what the authors wanted to do is to develop a tool that would execute ad-hoc queries in large-scale datasets rapidly
Dremel is an interactive ad-hoc query system It is scalable, fault tolerant and Fast It performs analysis on in situ nested data In situ means: it accesses data 'in place' Which means, it executes the computation in the place that the data are stored. In this case, BigTable of Google File System is used, so it does not take the data and take them into the tool, but the tool operates inside the dataset. Nested data, non relational data An Interoperation between the Dremel (query processor) and other data management tools
There is a clear comparison between Dremel and MapReduce on the paper. For now, I'll leave this blank and come back when it's time :)
So! Let's start with the main characteristics of Dremel! What makes Dremel so special is the use & combination of: Columnar storage format of the data Multi-level serving tree for query execution
So far, data were stored as records. Let's imagine we have a database with information for each EMDC student. Each record (raw) consists of name, age, nationality and other data of the student What's done so far, was to store all information for each student gathered in a record Google, then, comes with this novel idea to store data in columns. That means, all names are stored together, all ages together, nationalities, etc. So if Sarunas wants to see the ages of his students, he can just query the age and only the column age will be read. That way, they improved retrieval efficiency → less data have to be read
Dremel uses an SQL-like language And for executing queries, it uses multi-level serving trees We have many servers, and one of them is the root server. The root server receives the query from the client and: – determines all tablets of the table related to the query – rewrites the query and sends it to the next level servers → How it rewrites it? In a way that each intermediate server will be assigned some of the tablets – the intermediate servers do the same – rewrite the query they received – and send it to the next level. – when queries reach the leaf servers, they scan the tablets & execute the queries in parallel – by accessing the common storage (Google File System) and send the result back to their parent – each intermediate server receives more than one values and aggregates the results into one. – this is done in all servers, till we reach the root server. Each servers has an internal execution tree which includes evaluation of aggregation functions → for optimization purposes
Dremel is a multi-user system → several queries are executed at the same time. Fault-tolerance and straggler detection also play positively in to execution time 3-way replication When a leaf server can not access a tablet replica, it falls over to another replica. Parameter specifies the minimum percentage of tablets that must be scanned before returning a result. → setting up this parameter low, it can speed up the execution significantly. Dremel allows for "99.9%" type results, that reflect almost all, but not quite all, of the data.
Now let's move on to the experiments they conducted. I only present the most important ones – according to me :) The authors used 5 different tables in 2 different datasets, each one with different number of records, starting from 4 billion, up to more than 1 trillion. The compressed data vary from 13TB to 105 TB While The number of fields begin with 30 and reaches 1200
In the first experiment they
A team of Israeli engineers is building a clone they called OpenDremel, though one of these developers, David Gruzman, tells us that coding is only just beginning again after a long hiatus. Google now offers a Dremel web service it calls BigQuery. You can use the platform via an online API, or application programming interface. Basically, you upload your data to Google, and it lets you run queries on its internal infrastructure.
There is a clear comparison between Dremel and MapReduce on the paper. Their intention is not to replace MapReduce But to complement MapReduce
- Hello everybody. I will present Dremel, a tool developed in Google, - It is being used at Google since 2006 - But the paper was published in 2010

Google's Dremel

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Google's Dremel

Similar to Google's Dremel (20)

More from Maria Stylianou

More from Maria Stylianou (16)

Recently uploaded

Recently uploaded (20)

Google's Dremel

Editor's Notes