Cloudera impala

Cloudera
Impala

Real
Time
Query
for
HDFS
and
HBase

Beyond
Batch

What
is
Impala

Capability

Architecture

Demo

2

Beyond
Batch

For
some
things
MapReduce
is
just
too
slow

Apache
Hive:

MapReduce
execuHon
engine

High-‐latency,
low
throughput

High
runHme
overhead

Google
realized
this
early
on

Analysts
wanted
fast,
interacHve
results

3

Dremel

Google
paper
(2010)

“scalable,
interac.ve
ad-‐hoc
query
system
for

analysis
of
read-‐only
nested
data”

Columnar
storage
format

Distributed
scalable
aggregaHon

“capable
of
running
aggrega.on
queries
over

trillion-‐row
tables
in
seconds”

hUp://research.google.com/pubs/pub36632.html

4

Impala:
Goals

General-‐purpose
SQL
query
engine
for
Hadoop

For
analyHcal
and
transacHonal
workloads

Support
queries
that
take
μs
to
hours

Run
directly
with
Hadoop

Collocated
daemons

Same
ﬁle
formats

Same
storage
managers
(NN,
metastore)

5

Impala:
Goals

High
performance

C++

runHme
code
generaHon
(LLVM)

direct
access
to
data
(no
MapReduce)

Retain
user
experience

easy
for
Hive
users
to
migrate

100%
open-‐source

6

Impala:
Capability

HiveQL
(subset
of
SQL92)

select,
project,
join,
union,
subqueries,

aggregaHon,
insert,
order
by
(with
limit)

DDL

Directly
queries
data
in
HDFS
&
HBase

Text
ﬁles
(compressed)

Sequence
ﬁles
(snappy/gzip)

Avro
&
Trevni
GA
features

7

Impala:
Capability

Familiar
and
uniﬁed
plagorm

Uses
Hive’s
metastore

Submit
queries
via
ODBC
|
Beeswax
Thril
API

Query
is
distributed
to
nodes
with
relevant
data

Process-‐to-‐process
data
exchange

Kerberos
authenHcaHon

No
fault
tolerance

8

Impala:
Performance

Greater
disk
throughput

~100MB/sec/disk

I/O-‐bound
workloads
faster
by
3-‐4x

Queries
that
require
mulHple
map-‐reduce

phases
in
Hive
are
signiﬁcantly
faster
in
Impala

(up
to
45x)

Queries
that
run
against
in-‐memory
cached
data

see
a
signiﬁcant
speedup
(up
to
90x)

9

Impala:
Architecture

impalad

runs
on
every
node

handles
client
requests
(ODBC,
thril)

handles
query
planning
&
execuHon

statestored

provides
name
service

metadata
distribuHon

used
for
ﬁnding
data

10

Impala:
Architecture

11

Impala:
Architecture

12

Impala:
Architecture

13

Impala:
Architecture

14

Current
limitaHons

Public
Beta
(available
since
24
Oct
2012)

No
SerDes

No
User
Deﬁned
FuncHons
(UDF’s)

Joins
are
done
in
memory
space
no
larger

than
that
of
smallest
node

impalad’s
only
read
statestored
metadata
at

startup

15

Futures

GA
Q1
2013

DDL
support
(CREATE,
ALTER)

Rudimentary
cost-‐based
opHmizer
(CBO)

Joins
done
in
aggregate
memory

metadata
distribuHon
through
statestored

Doug
Curng’s
Trevni

Columnar
storage
format
like
Dremel’s

Impala
+
Trevni
=
Dremel
superset

16

Demo

impala-‐user@cloudera.com

kinley@cloudera.com

@jrkinley

17

Cloudera impala

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (18)

Semelhante a Cloudera impala

Semelhante a Cloudera impala (20)

Mais de Swiss Big Data User Group

Mais de Swiss Big Data User Group (20)

Último

Último (20)

Cloudera impala