Hadoop User Group - Status Apache Drill

Apache
Drill
status

Michael
Hausenblas,
Chief
Data
Engineer
EMEA,
MapR

HUG
Munich,
2013-‐04-‐19

Kudos
to
hEp://cmx.io/

Workloads

•  Batch
processing
(MapReduce)

•  Light-‐weight
OLTP
(HBase,
Cassandra,
etc.)

•  Stream
processing
(Storm,
S4)

•  Search
(Solr,
ElasVcsearch)

•  Interac1ve,
ad-‐hoc
query
and
analysis
(?)

Impala
InteracVve
Query
at
Scale

low-‐latency

Use
Case
I

•  Jane,
a
markeVng
analyst

•  Determine
target
segments

•  Data
from
diﬀerent
sources

Use
Case
II

•  LogisVcs
–
supplier
status

•  Queries

– How
many
shipments
from
supplier
X?

– How
many
shipments
in
region
Y?

SUPPLIER_ID
NAME
REGION

ACM
ACME
Corp
US

GAL
GotALot
Inc
US

BAP
Bits
and
Pieces
Ltd
Europe

ZUP
Zu
Pli
Asia

{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…

Today’s
SoluVons

•  RDBMS-‐focused

–  ETL
data
from
MongoDB
and
Hadoop

–  Query
data
using
SQL

•  MapReduce-‐focused

–  ETL
from
RDBMS
and
MongoDB

–  Use
Hive,
etc.

Requirements

•  Support
for
diﬀerent
data
sources

•  Support
for
diﬀerent
query
interfaces

•  Low-‐latency/real-‐Vme

•  Ad-‐hoc
queries

•  Scalable,
reliable

Google’s
Dremel*

*)
hEp://research.google.com/pubs/pub36632.html

Apache
Drill
Overview

•  Inspired
by
Google’s
Dremel

•  Standard

SQL
2003
support

•  Other
QL
possible

•  Plug-‐able
data
sources

•  Support
for
nested
data

•  Schema
is
opVonal

•  Community
driven,
open,
100’s
involved

High-‐level
Architecture

High-‐level
Architecture

•  Each
node:
Drillbit
-‐
maximize
data
locality

•  Co-‐ordinaVon,
query
planning,
execuVon,
etc,
are
distributed

•  By
default
Drillbits
hold
all
roles

•  Any
node
can
act
as
endpoint
for
a
query

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

High-‐level
Architecture

•  Zookeeper
for
ephemeral
cluster
membership
info

•  Distributed
cache
(Hazelcast)
for
metadata,
locality

informaVon,
etc.

Curator/Zk

Distributed
Cache

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Distributed
Cache
Distributed
Cache
Distributed
Cache

High-‐level
Architecture

•  Origina1ng
Drillbit
acts
as
foreman,
manages
query
execuVon,

scheduling,
locality
informaVon,
etc.

•  Streaming
data
communica1on
avoiding
SerDe

Curator/Zk

Distributed
Cache

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Storage

Process

Drillbit

node

Distributed
Cache
Distributed
Cache
Distributed
Cache

Principled
Query
ExecuVon

Source

Query
Parser

Logical

Plan
OpVmizer

Physical

Plan
ExecuVon

SQL
2003

DrQL

MongoQL

DSL

scanner
API
topology
query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op:
"filter",
condition:
"x > 3”
},
parser
API

Drillbit
Modules

DFS
Engine

HBase
Engine

RPC
Endpoint

SQL

HiveQL

Pig

Parser

Distributed
Cache

Logical
Plan

Physical
Plan

OpVmizer

Storage
Engine
Interface

Scheduler

Foreman

Operators

Mongo

Key
Features

•  Full
SQL
2003

•  Nested
data

•  OpVonal
schema

•  Extensibility
points

Full
SQL
–
ANSI
SQL
2003

•  SQL-‐like
is
oien
not
enough

•  IntegraVon
with
exisVng
tools

–  Datameer,
Tableau,
Excel,
SAP
Crystal
Reports

–  Use
standard
ODBC/JDBC
driver

Nested
Data

•  Nested
data
becoming
prevalent

–  JSON/BSON,
XML,
ProtoBuf,
Avro

–  Some
data
sources
support
it
naVvely

(MongoDB,
etc.)

•  FlaEening
nested
data
is
error-‐prone

•  Extension
to
ANSI
SQL
2003

OpVonal
Schema

•  Many
data
sources
don’t
have
rigid
schemas

–  Schema
changes
rapidly

–  Diﬀerent
schema
per
record
(e.g.
HBase)

•  Supports
queries
against
unknown
schema

•  User
can
deﬁne
schema
or
via
discovery

Extensibility
Points

•  Source
query
à
parser
API

•  Custom
operators,
UDF
à
logical
plan

•  Serving
tree,
CF,
topology
à
physical
plan/opVmizer

•  Data
sources
&formats
à
scanner
API

Source

Query
Parser

Logical

Plan
OpVmizer

Physical

Plan
ExecuVon

…
and
Hadoop?

•  HDFS
can
be
a
data
source

•  Complementary
use
cases*

•  …
use
Apache
Drill

–  Find
record
with
speciﬁed
condiVon

–  AggregaVon
under
dynamic
condiVons

•  …
use
MapReduce

–  Data
mining
with
mulVple
iteraVons

–  ETL

22

*)
hEps://cloud.google.com/ﬁles/BigQueryTechnicalWP.pdf

Example

hEps://cwiki.apache.org/conﬂuence/display/DRILL/Demo+HowTo

{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
"batter”:
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
…
data
source:
donuts.json

query:[ {
op:"sequence",
do:[
{
op: "scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{
op: "filter",
expr: "donuts.ppu < 2.00"
},
…
logical
plan:
simple_plan.json

result:
out.json

{
"sales" : 700.0,
"typeCount" : 1,
"quantity" : 700,
"ppu" : 1.0
}
{
"sales" : 109.71,
"typeCount" : 2,
"quantity" : 159,
"ppu" : 0.69
}
{
"sales" : 184.25,
"typeCount" : 2,
"quantity" : 335,
"ppu" : 0.55
}

Status

•  Heavy
development
by
mulVple
organizaVons

•  Available

– Logical
plan
(ADSP)

– Reference
interpreter

– Basic
SQL
parser

– Basic
demo

– Basic
HBase
back-‐end

Status

April
2013

•  Extend
SQL
syntax

•  Physical
plan

•  In-‐memory
compressed
data
interfaces

•  Distributed
execuVon

ContribuVng

•  Learn
where
and
how
to
contribute

hEps://cwiki.apache.org/conﬂuence/display/DRILL/
ContribuVng

•  Jira,
Git,
Apache
build
and
test
tools

•  Preparing
for
dependencies

–  Hazelcast

–  Neolix
Curator

ContribuVng

General
contribuVons
appreciated:

•  Supersonic
(?)

•  Test
data
&
test
queries

•  Use
case
scenarios
(textual
desc./SQL
queries)

•  DocumentaVon

ContribuVng

•  Dremel-‐inspired
columnar
format

–  TwiEer’s
Parquet

–  Hive’s
ORC
ﬁle

•  IntegraVon
with
Hive
metastore
(?)

•  DRILL-‐13
Storage
Engine:
Deﬁne
Java
Interface

•  DRILL-‐15
Build
HBase
storage
engine
implementaVon

ContribuVng

•  DRILL-‐48
RPC
interface
for
query
submission
and
physical
plan

execuVon

•  DRILL-‐53
Setup
cluster
conﬁguraVon
and
membership
mgmt

system

•  Further
schedule

–  Alpha
Q2

–  Beta
Q3

Kudos
to
…

•  Julian
Hyde,
Pentaho

•  Lisen
Mu

•  Tim
Chen,
Microsoi

•  Chris
Merrick,
RJMetrics

•  David
Alves,
UT
AusVn

•  Sree
Vaadi,
SSS/NGData

•  Jacques
Nadeau,
MapR

•  Ted
Dunning,
MapR

Engage!

•  Follow
@ApacheDrill
on
TwiEer

•  Sign
up
at
mailing
lists
(user
|
dev)

hEp://incubator.apache.org/drill/mailing-‐lists.html

•  Standing
G+
hangouts
every
Tuesday
at
18:00
CET

•  Keep
an
eye
on
hEp://drill-‐user.org/

Hadoop User Group - Status Apache Drill

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (6)

Semelhante a Hadoop User Group - Status Apache Drill

Semelhante a Hadoop User Group - Status Apache Drill (20)

Mais de MapR Technologies

Mais de MapR Technologies (20)

Último

Último (20)

Hadoop User Group - Status Apache Drill