5. Use
Case
I
• Jane,
a
markeVng
analyst
• Determine
target
segments
• Data
from
different
sources
6. Use
Case
II
• LogisVcs
–
supplier
status
• Queries
– How
many
shipments
from
supplier
X?
– How
many
shipments
in
region
Y?
SUPPLIER_ID
NAME
REGION
ACM
ACME
Corp
US
GAL
GotALot
Inc
US
BAP
Bits
and
Pieces
Ltd
Europe
ZUP
Zu
Pli
Asia
{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…
7. Today’s
SoluVons
• RDBMS-‐focused
– ETL
data
from
MongoDB
and
Hadoop
– Query
data
using
SQL
• MapReduce-‐focused
– ETL
from
RDBMS
and
MongoDB
– Use
Hive,
etc.
8. Requirements
• Support
for
different
data
sources
• Support
for
different
query
interfaces
• Low-‐latency/real-‐Vme
• Ad-‐hoc
queries
• Scalable,
reliable
10. Apache
Drill
Overview
• Inspired
by
Google’s
Dremel
• Standard
SQL
2003
support
• Other
QL
possible
• Plug-‐able
data
sources
• Support
for
nested
data
• Schema
is
opVonal
• Community
driven,
open,
100’s
involved
12. High-‐level
Architecture
• Each
node:
Drillbit
-‐
maximize
data
locality
• Co-‐ordinaVon,
query
planning,
execuVon,
etc,
are
distributed
• By
default
Drillbits
hold
all
roles
• Any
node
can
act
as
endpoint
for
a
query
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
13. High-‐level
Architecture
• Zookeeper
for
ephemeral
cluster
membership
info
• Distributed
cache
(Hazelcast)
for
metadata,
locality
informaVon,
etc.
Curator/Zk
Distributed
Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed
Cache
Distributed
Cache
Distributed
Cache
14. High-‐level
Architecture
• Origina1ng
Drillbit
acts
as
foreman,
manages
query
execuVon,
scheduling,
locality
informaVon,
etc.
• Streaming
data
communica1on
avoiding
SerDe
Curator/Zk
Distributed
Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed
Cache
Distributed
Cache
Distributed
Cache
17. Key
Features
• Full
SQL
2003
• Nested
data
• OpVonal
schema
• Extensibility
points
18. Full
SQL
–
ANSI
SQL
2003
• SQL-‐like
is
oien
not
enough
• IntegraVon
with
exisVng
tools
– Datameer,
Tableau,
Excel,
SAP
Crystal
Reports
– Use
standard
ODBC/JDBC
driver
19. Nested
Data
• Nested
data
becoming
prevalent
– JSON/BSON,
XML,
ProtoBuf,
Avro
– Some
data
sources
support
it
naVvely
(MongoDB,
etc.)
• FlaEening
nested
data
is
error-‐prone
• Extension
to
ANSI
SQL
2003
20. OpVonal
Schema
• Many
data
sources
don’t
have
rigid
schemas
– Schema
changes
rapidly
– Different
schema
per
record
(e.g.
HBase)
• Supports
queries
against
unknown
schema
• User
can
define
schema
or
via
discovery
21. Extensibility
Points
• Source
query
à
parser
API
• Custom
operators,
UDF
à
logical
plan
• Serving
tree,
CF,
topology
à
physical
plan/opVmizer
• Data
sources
&formats
à
scanner
API
Source
Query
Parser
Logical
Plan
OpVmizer
Physical
Plan
ExecuVon
22. …
and
Hadoop?
• HDFS
can
be
a
data
source
• Complementary
use
cases*
• …
use
Apache
Drill
– Find
record
with
specified
condiVon
– AggregaVon
under
dynamic
condiVons
• …
use
MapReduce
– Data
mining
with
mulVple
iteraVons
– ETL
22
*)
hEps://cloud.google.com/files/BigQueryTechnicalWP.pdf
24. Status
• Heavy
development
by
mulVple
organizaVons
• Available
– Logical
plan
(ADSP)
– Reference
interpreter
– Basic
SQL
parser
– Basic
demo
– Basic
HBase
back-‐end
25. Status
April
2013
• Extend
SQL
syntax
• Physical
plan
• In-‐memory
compressed
data
interfaces
• Distributed
execuVon
26. ContribuVng
• Learn
where
and
how
to
contribute
hEps://cwiki.apache.org/confluence/display/DRILL/
ContribuVng
• Jira,
Git,
Apache
build
and
test
tools
• Preparing
for
dependencies
– Hazelcast
– Neolix
Curator
27. ContribuVng
General
contribuVons
appreciated:
• Supersonic
(?)
• Test
data
&
test
queries
• Use
case
scenarios
(textual
desc./SQL
queries)
• DocumentaVon
29. ContribuVng
• DRILL-‐48
RPC
interface
for
query
submission
and
physical
plan
execuVon
• DRILL-‐53
Setup
cluster
configuraVon
and
membership
mgmt
system
• Further
schedule
– Alpha
Q2
– Beta
Q3
30. Kudos
to
…
• Julian
Hyde,
Pentaho
• Lisen
Mu
• Tim
Chen,
Microsoi
• Chris
Merrick,
RJMetrics
• David
Alves,
UT
AusVn
• Sree
Vaadi,
SSS/NGData
• Jacques
Nadeau,
MapR
• Ted
Dunning,
MapR
31. Engage!
• Follow
@ApacheDrill
on
TwiEer
• Sign
up
at
mailing
lists
(user
|
dev)
hEp://incubator.apache.org/drill/mailing-‐lists.html
• Standing
G+
hangouts
every
Tuesday
at
18:00
CET
• Keep
an
eye
on
hEp://drill-‐user.org/