Apache Drill (ver. 0.2)

Apache Drill
Design proposal from
OpenDremel team
HLD Version 0.2, 9/sep/2012

Camuel Gilyadov & Constantine Peresypkin,
Email: Camuel@BigDataCraft.com

Intro

• This is high-level design proposal for project
ApacheDrill from OpenDeremel team.
• History slides and usual “about us” stuff moved to the
end of the deck.
• Slide with all relevant links also published in the end.

Design Tenet #1

• Apache Drill must support multi-tenant semantics
internally and not to be run altogether in guest VMs.
• It should be inspired by BigQuery and not only by
Dremel/PowerDrill/Tenzing papers.
• It is not practical to setup dedicated cloud (billed
hourly) just to be able to run a query for a few seconds.
• The codebase must be clearly divided into trusted part
and untrusted part. Trusted part must be kept to
absolute minimum and must be peer-reviewed, secured,
audited and metered.

Design Tenet #2

• Apache Drill must be modular and customizable in
many dimensions.
• Schema-on-read concept must be supported.
Imperatively coded high-performance data parser must
embeddable into the query.
• SQL is not longer enough. New query languages must
be easily added as well as user-defined-functions (UDF)
implementing deep-analytics (such as statistics and
machine learning).
• Additionally various data-formats must be supported
like column-stores, row-stores, PAX, RCFiles and etc.

Design Tenet #2 (cont.)

• We suggest that query plan format will be relaxed to
arbitrary executable, and data format relaxed to
arbitrary opaque BLOB.
• This way new query languages and new data formats
could be easily supported without changing backend.
• As added benefit backend becomes generic lightweight
homogeneous compute-storage cloud.
• Such approach exhibits good separation of control.
Cloud operator controls and bills for generic
infrastructure and the query engine is left completely in
the control of the tenant/user.

Design Tenet #3

• Apache Drill requests/queries must be hyper-elastic
meaning capability to exploit compute capacity of
thousands of servers for short duration of just a few
seconds. No resources must be kept spinning per user
between queries or when idle.
• Traditional VMs are too heavyweight for that.
Container approach such as OpenVZ/LXC and etc. are
not secure enough in multi-tenancy context.
• We suggest making sandboxing pluggable and
supporting ZeroVM ( developed for OpenDremel ) and
LXC (is fine for private clouds) to begin with.

Design Tenet #4

• Apache Drill must be efficient.
• Value-per-bit is extremely low with BigData.
• Overhead in the inner loop must be kept to minimum.
• Java was found inefficient for general number
crunching (such as data compression). The main
problem with Java is that GC overhead is unavoidable
for the whole data corpus being scanned. We went so
far as to keep all data in byte arrays and auto-generate
transformation code and it still underperformed and
code complexity went through the roof.

Suggested Architecture
Browser / Client Single-Tenant Multi-Tenant
Frontend Backend
running inside scale-out object store
traditional guest VM and in-situ compute

JVM

Query Query
Compiler
Executable job

Executable job

Suggested Frontend
Design
• Usual Java single-tenant web application.
• In charge of:
– All interaction with user.
– Query/job submission
– Query/job progress monitoring
– Result browsing

Client Tools Java Servlet
CLI
REST Query
AJAX App Gateway Compiler

Suggested AJAX

• What AJAX framework?
• ExtJs?
• Look&Feel – just clone Google App with the
trademarks and logos replaced?
• Why WebUI of Drill is more important than
Hive?
– Drill is interactive, at least basic WebUI must be
provided with each release.

Suggested CLI
Design
• Bash+curl would suffice?
• Full blown Java CLI tool?

Suggested REST-GW
Design
• Usual vanilla Java WebApp with Spring!

Suggested Query
Compiler Design #1
• Query Compiler consists from two component
libraries with stable but language-dependent (so
no reuse unfortunately ) interface between them:

Query Executable
Text Parsers Semantic ModelReader Planners Script

Syntax Semantic
Errors Errors

Suggested Query
Compiler Design #2
• DrqlSemanticModelReader is ready and published
under …..
• SemanticModel that parsers produces closely follows
original language. Parsers just parses query text and
doesn’t attempts to “give it meaning” or annotate.
• Simplified example:
– List<Expression> getResultColumns()
– List<DrqlQuery> getFromClause();
– List<ColumnId> getGroupByClause();
– etc….

Suggested Query
Compiler Design #3
• What is Executable Script?
– Self-contained serializable, executable object. When executed with
appropriate executor and yields correct query result on given input data
of expected format
– Self contained means no dependencies, everything is included in that
executable object.
– Particularly data parsing logic is included.
– However, data access logic is NOT included.
– The model for script is: “here is your blob of size N mapped to
memory starting from address S, you have time T to generate your
result up to size R in memory starting from address D. You will be
terminated without advance notice for any attempted violation of
any restriction”

Suggested Query
Compiler Design #3
• How executable script is generated?
1. Query object implementing SemanticModelReader interface is
provided to planner by parser.
2. Planner logic examines semantic model through the
SemanticModelReader interface and produces query plan
object, that implements QueryPlanModelReader interface.
Query analysis and optimization takes place during this stage and if
needed additional interface of QueryPlanModelRewriter
and/or QueryPlanModelVisitor could be created for this
reason. However DrQL is a simple language without large (or any)
search space so optimizer value is small. We suggest bypassing
altogether query rewriting and query optimization for initial releases.
3. When query plan is generated, a most appropriate code template script
is selected. Then template engine processes template coupled with
QueryPlanModelReader object to produce executable

Suggested Backend Design

• TODO
• Executors per se
– Janino based Java Executor
– LXC-GCC based C Eexecutor
– ZeroVM-GCC based C Executor
• Storage platforms with collocated data processing
– Local files (non distributed)
– HDFS
– OpenStack Swift

OpenDremel/Dazo
Two separate unfinished We call it Metaxa We call it Zwift
jQuery apps & cmdline (historic reasons) (Swift + ZeroVM)
app with no particular BQL Parser, unfinished
codenames compiler based on Apache
Alpha Quality
Velocity

JVM

Query Query
Compiler

Executable job

What is Swift?

“Swift is a highly available, distributed,
eventually consistent object/blob store.
Organizations can use Swift to store
lots of data efficiently, safely, and
cheaply.”

Don’t get it?

Swift is THE open-source
implementation of
Amazon S3

What is ZeroVM?

Highly-secure, low-overhead, low-latency container-style
virtualization based on Google Native Client project. The
critical security code is transferred verbatim from Chrome
Browser project and therefore is as secure as Chrome
Browser. More info: http://ZeroVM.org and
http://news.ycombinator.com/item?id=3746222

ZeroVM highlights

1. Disposable VM per request
2. HyperElasticity per request
3. Embeddable into everything
4. High-performance (x86/ARM)
5. Erlang inspired clustering
6. Written in pure C, not deps

Don’t get it?

ZeroVM to Virtualization
is what
SQLite is to Databases

Links

• https://github.com/ApacheDrill/Brainstorm/wiki/Apache-Drill-Links
• OpenDremel (1st generation design):
– http://code.google.com/p/dremel/source/browse?repo=dremel
– http://code.google.com/p/dremel/source/browse?repo=metaxa

• Dazo (2nd generation design):
– https://github.com/Dazo-org

OpenDremel Story: 2010

• Camuel Gilyadov started Dremel implementation on
summer 2010 named OpenDremel.
• David Gruzman joined the effort a few months later
followed by Constantine Peresypkin.
• There wasn’t a comprehensive design or architecture.
The goal was to get hierarchal-columnar transformation
working smoothly and in strict accordance to the
Dremel paper. Several working implementations are
published by us under Apache License.
• Hong San was hired as first full-timer to speedup the
development. Metaxa milestone was set.

• OpenDremel early design was found too naive, mainly due to
Java underperformance in inner number-crunching loops.
• After fierce brainstorming, project was restarted from scratch
under new name Dazo. With Dazo, query plan is an arbitrary
piece of executable native code with Java frontend.
• From now on we got inspiration from BigQuery as opposed to
from Dremel paper.
• We decided to use Google NaCl as sandboxing technology to
isolate queries as well as meter resource consumption. The new
sandbox was named ZeroVM.
• As for storage we decided to use OpenStack Swift.


• Four people full-time, several others part time, we still
don’t have fully integrated version but we are satisfied
with what we have achieved and convinced that the
decisions behind Dazo were correct.
• We believe ZeroVM could be a disruptive technology in
itself revolutionizing BigData@Cloud space.
• We are excited by Apache Drill initiative and hope to be
useful for it.
• Check the blog: http://BigDataCraft.com

Thanks
Camuel Gilyadov,
Email: Camuel@BigDataCraft.com

Apache Drill (ver. 0.2)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Apache Drill (ver. 0.2)

Semelhante a Apache Drill (ver. 0.2) (20)

Último

Último (20)

Apache Drill (ver. 0.2)