Mais conteúdo relacionado Semelhante a Impala: Real-time Queries in Hadoop (20) Mais de Cloudera, Inc. (20) Impala: Real-time Queries in Hadoop2. Why
Data
Scien/sts
Love
Hadoop
• Massive
volumes
of
data
• Data
prepara/on
&
analy/cs
in
1
environment
• Highly
flexible
environment
for
crea/ng
&
tes/ng
machine
learning
models
• 10%
the
cost/TB
under
management
©2012
Cloudera,
Inc.
All
Rights
Reserved.
3. Hadoop
Use
Cases
Moving
to
Real-‐Time
Already
query
Already
load
data
into
Already
use
HBase
for
Hadoop
using
Hive
CDH
every
90
mins
or
less
real-‐/me
data
access
Source:
Cloudera
customer
survey
August
2012
©2012
Cloudera,
Inc.
All
Rights
Reserved.
4. But
Hadoop
Isn’t
Fast
Enough
Need
faster
Move
data
from
See
value
today
in
queries
on
Hadoop
to
RDBMS
for
consolida/ng
to
a
Hadoop
data
interac/ve
SQL
single
plaYorm
Source:
Cloudera
customer
survey
August
2012
©2012
Cloudera,
Inc.
All
Rights
Reserved.
5. Beyond
Batch
–
The
Next
Stage
for
Hadoop
HADOOP
TODAY
IS
TOO
SLOW
MapReduce
is
batch
Simple
queries
can
take
minutes
/
tens
of
minutes
CURRENT
DATA
MANAGEMENT
IS
TOO
COMPLEX
Op/mized
for
rigid
schemas
&
special
purpose
applica/ons
Redundant
data
storage
&
processes
Very
expensive
systems:
$20K-‐150K
/
TB
©2012
Cloudera,
Inc.
All
Rights
Reserved.
6. Cloudera
Enterprise
RTQ
Real-‐Time
Query
for
Data
Stored
in
Hadoop
Powered
by
Cloudera
Impala.
Supports
Hive
SQL
4-‐30X
faster
than
Hive
over
MapReduce
Supports
mul/ple
storage
engines
&
file
formats
Uses
exis/ng
drivers,
integrates
with
exis/ng
metastore,
works
with
leading
BI
tools
Flexible,
cost-‐effec/ve,
no
lock-‐in
Deploy
&
operate
with
Cloudera
Manager
©2012
Cloudera,
Inc.
All
Rights
Reserved.
7. Cloudera
Now
Powered
by
Impala
BEFORE
IMPALA
WITH
IMPALA
USER
INTERFACE
BATCH
PROCESSING
REAL-‐TIME
ACCESS
• Unified
Storage:
• With
Impala:
Supports
HDFS
and
HBase
Real-‐/me
SQL
queries
Flexible
file
formats
Na/ve
distributed
query
engine
• Unified
Metastore
Op/mized
for
low-‐latency
• Unified
Security
• Provides:
• Unified
Client
Interfaces:
Answers
as
fast
as
you
can
ask
ODBC,
SQL
syntax,
Hue
Beeswax
Everyone
to
ask
ques/ons
for
all
data
Big
data
storage
and
analy/cs
together
©2012
Cloudera,
Inc.
All
Rights
Reserved.
8. Impala
beta
features
Today
(Cloudera
Impala
0.1):
• Nearly
all
of
Hive's
SQL,
including
insert,
join,
and
subqueries
• Query
results
4-‐30X
faster
than
Hive
• Same
open
Hive
metadata
model
=>
easy
to
create
&
change
schema
• Support
for
HDFS
and
HBase
storage
• HDFS
file
formats:
TextFile,
SequenceFile
• HDFS
compression:
Snappy,
GZIP,
BZIP
• Common
ODBC
driver
and
Hue
Beeswax
with
Hive
• Separate
CLI
than
Hive
Next
few
months:
• Support
for
Avro,
RCFile
&
LZO
compressed
text
• Addi/onal
OS
support
• Trevni
columnar
format
• JDBC
driver
• DDL
• Straggler
handling
• Increased
join
perf
©2012
Cloudera,
Inc.
All
Rights
Reserved.
9. Impala
v0.1
SQL
(HiveQL)
• Select
– Boolean,
/nyint,
smallint,
int,
bigint,
float,
double,
/mestamp,
string
– All,
dis/nct
– Subqueries
(in
from
clause)
– Where,
group
by,
having
– Order
by
(with
limit
ini/ally)
– Joins
(ler,
right,
full,
outer),
mul/-‐table,
subquery
– Union
all
– Limit
– External
tables
– Rela/onal,
arithme/c,
logical
operators
– Math,
collec/on,
cast,
date,
condi/onal,
string,
/mestamp
built-‐ins
(e.g.
count,
sum,
cast,
case,
like,
in,
between,
coalesce)
• Insert
into
©2012
Cloudera,
Inc.
All
Rights
Reserved.
10. Cloudera
Impala
Details
Common
Hive
SQL
and
interface
Unified
metadata
and
scheduler
SQL
App
Hive
State
Metastore
YARN
HDFS
NN
Store
ODBC
Query
Planner
Query
Planner
Fully
MPP
Query
Planner
Query
Coordinator
Query
Coordinator
Distributed
Query
Coordinator
Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
Local
Direct
Reads
©2012
Cloudera,
Inc.
All
Rights
Reserved.
11. Cloudera
Impala
Details
Common
Hive
SQL
and
interface
SQL
App
Hive
State
Metastore
YARN
HDFS
NN
Store
ODBC
SQL
Request
Query
Planner
Query
Planner
Query
Planner
Query
Coordinator
Query
Coordinator
Query
Coordinator
Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
©2012
Cloudera,
Inc.
All
Rights
Reserved.
12. Cloudera
Impala
Details
Unified
metadata
and
scheduler
SQL
App
Hive
State
Metastore
YARN
HDFS
NN
Store
ODBC
Query
Planner
Query
Planner
Query
Planner
Query
Coordinator
Query
Coordinator
Query
Coordinator
Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
©2012
Cloudera,
Inc.
All
Rights
Reserved.
13. Cloudera
Impala
Details
SQL
App
Hive
State
Metastore
YARN
HDFS
NN
Store
ODBC
Query
Planner
Query
Planner
Fully
MPP
Query
Planner
Query
Coordinator
Query
Coordinator
Distributed
Query
Coordinator
Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
©2012
Cloudera,
Inc.
All
Rights
Reserved.
14. Cloudera
Impala
Details
SQL
App
Hive
State
Metastore
YARN
HDFS
NN
Store
ODBC
Query
Planner
Query
Planner
Query
Planner
Query
Coordinator
Query
Coordinator
Query
Coordinator
Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
Local
Direct
Reads
©2012
Cloudera,
Inc.
All
Rights
Reserved.
15. Cloudera
Impala
Details
SQL
App
Hive
State
Metastore
YARN
HDFS
NN
Store
ODBC
SQL
Results
Query
Planner
Query
Planner
In
Memory
Query
Planner
Query
Coordinator
Query
Coordinator
Transfers
Query
Coordinator
Query
Exec
Engine
Query
Exec
Engine
Query
Exec
Engine
HDFS
DN
HBase
HDFS
DN
HBase
HDFS
DN
HBase
©2012
Cloudera,
Inc.
All
Rights
Reserved.
16. Impala
and
Hive
• Shared
with
Hive:
– Metadata
(table
defini/ons)
– ODBC
driver
– Hue
Beeswax
– SQL
syntax
(HiveQL)
– Flexible
file
formats
– Machine
pool
• Improvements:
– Purpose-‐built
query
engine
direct
on
HDFS
and
HBase
– No
JVM
and
MapReduce
– In-‐memory
data
transfers
– Low-‐latency
scheduler
– Na/ve
distributed
rela/onal
query
engine
– Trevni
columnar
format
(arer
v0.1)
©2012
Cloudera,
Inc.
All
Rights
Reserved.
17. Advantages
of
Our
Approach
• No
high-‐latency
MapReduce
batch
processing
• Local
processing
avoids
network
botlenecks
• No
costly
data
format
conversion
overhead
• All
data
immediately
query-‐able
• Single
machine
pool
to
scale
• All
machines
available
to
both
Impala
and
MapReduce
• Single,
open,
and
unified
metadata
and
scheduler
MapReduce
Remote
Query
Side
Storage
Query
Query
Query
Query
Node
Node
Node
Node
Query
MR
Hive
Engine
MR
OR
MR
DN
NN
DN
HDFS
DN
DN
DN
©2012
Cloudera,
Inc.
All
Rights
Reserved.
18. Google
Dremel
and
Impala
• What
is
Dremel:
– Columnar
storage
for
data
with
nested
structures
– Distributed
scalable
aggrega/on
on
top
of
that
• Columnar
storage
in
Hadoop:
Trevni
– New
columnar
format
created
by
Doug
Cuung
– Stores
data
in
appropriate
na/ve/binary
types
– Will
also
store
nested
structures
similar
to
Dremel's
ColumnIO
• Distributed
aggrega/on:
Impala
• Impala
plus
Trevni:
a
superset
of
the
published
version
of
Dremel
(which
didn't
support
joins)
©2012
Cloudera,
Inc.
All
Rights
Reserved.
19. Benefits
of
Cloudera
Impala
Real-‐Time
Query
for
Data
Stored
in
Hadoop
• Get
answers
as
fast
as
you
can
ask
ques/ons
• Interac/ve
analy/cs
directly
on
source
data
• No
jumping
between
data
silos
• Reduce
duplicate
storage
with
EDW
• Reduce
data
movement
for
interac/ve
analysis
• Leverage
exis/ng
tools
and
employee
skills
• Ask
ques/ons
of
all
your
data
• No
informa/on
loss
from
aggrega/on
or
conforming
to
rela/onal
schemas
for
analysis
• Single
metadata
store
from
origina/on
through
analysis
• No
need
to
hunt
through
mul/ple
data
silos
©2012
Cloudera,
Inc.
All
Rights
Reserved.