James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
2. Beyond
Batch
What
is
Impala
Capability
Architecture
Demo
2
3. Beyond
Batch
For
some
things
MapReduce
is
just
too
slow
Apache
Hive:
MapReduce
execuHon
engine
High-‐latency,
low
throughput
High
runHme
overhead
Google
realized
this
early
on
Analysts
wanted
fast,
interacHve
results
3
4. Dremel
Google
paper
(2010)
“scalable,
interac.ve
ad-‐hoc
query
system
for
analysis
of
read-‐only
nested
data”
Columnar
storage
format
Distributed
scalable
aggregaHon
“capable
of
running
aggrega.on
queries
over
trillion-‐row
tables
in
seconds”
hUp://research.google.com/pubs/pub36632.html
4
5. Impala:
Goals
General-‐purpose
SQL
query
engine
for
Hadoop
For
analyHcal
and
transacHonal
workloads
Support
queries
that
take
μs
to
hours
Run
directly
with
Hadoop
Collocated
daemons
Same
file
formats
Same
storage
managers
(NN,
metastore)
5
6. Impala:
Goals
High
performance
C++
runHme
code
generaHon
(LLVM)
direct
access
to
data
(no
MapReduce)
Retain
user
experience
easy
for
Hive
users
to
migrate
100%
open-‐source
6
7. Impala:
Capability
HiveQL
(subset
of
SQL92)
select,
project,
join,
union,
subqueries,
aggregaHon,
insert,
order
by
(with
limit)
DDL
Directly
queries
data
in
HDFS
&
HBase
Text
files
(compressed)
Sequence
files
(snappy/gzip)
Avro
&
Trevni
GA
features
7
8. Impala:
Capability
Familiar
and
unified
plagorm
Uses
Hive’s
metastore
Submit
queries
via
ODBC
|
Beeswax
Thril
API
Query
is
distributed
to
nodes
with
relevant
data
Process-‐to-‐process
data
exchange
Kerberos
authenHcaHon
No
fault
tolerance
8
9. Impala:
Performance
Greater
disk
throughput
~100MB/sec/disk
I/O-‐bound
workloads
faster
by
3-‐4x
Queries
that
require
mulHple
map-‐reduce
phases
in
Hive
are
significantly
faster
in
Impala
(up
to
45x)
Queries
that
run
against
in-‐memory
cached
data
see
a
significant
speedup
(up
to
90x)
9
10. Impala:
Architecture
impalad
runs
on
every
node
handles
client
requests
(ODBC,
thril)
handles
query
planning
&
execuHon
statestored
provides
name
service
metadata
distribuHon
used
for
finding
data
10
15. Current
limitaHons
Public
Beta
(available
since
24
Oct
2012)
No
SerDes
No
User
Defined
FuncHons
(UDF’s)
Joins
are
done
in
memory
space
no
larger
than
that
of
smallest
node
impalad’s
only
read
statestored
metadata
at
startup
15
16. Futures
GA
Q1
2013
DDL
support
(CREATE,
ALTER)
Rudimentary
cost-‐based
opHmizer
(CBO)
Joins
done
in
aggregate
memory
metadata
distribuHon
through
statestored
Doug
Curng’s
Trevni
Columnar
storage
format
like
Dremel’s
Impala
+
Trevni
=
Dremel
superset
16