Cloudera Impala - San Diego Big Data Meetup August 13th 2014

1
Cloudera
Impala

SD
Big
Data
Monthly
Meetup
#2

August
13th
2014

Maxime
Dumas

Systems
Engineer

Thirty
Seconds
About
Max

•  Systems
Engineer

•  aka
Sales
Engineer

•  SoCal,
AZ,
NV

•  former
coder
of
PHP

•  teaches
meditaLon
+
yoga

•  from
Montreal,
Canada

2

What
Does
Cloudera
Do?

•  product

•  distribuLon
of
Hadoop
components,
Apache
licensed

•  enterprise
tooling

•  support

•  training

•  services
(aka
consulLng)

•  community

3

What
This
Talk
Isn’t
About

•  deploying

•  Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor

•  sizing
&
tuning

•  depends
heavily
on
data
and
workload

•  coding

•  unless
you
count
XML
or
CSV
or
SQL

•  algorithms

4

What
is
Cloudera
Impala?

6

cloud·∙e·∙ra
im·∙pal·∙a

7
/kloudˈi(ə)rə
imˈpalə/

noun

a
modern,
open
source,
MPP
SQL
query

engine
for
Apache
Hadoop.

“Cloudera
Impala
provides
fast,
ad
hoc
SQL

query
capability
for
Apache
Hadoop,

complemenLng
tradiLonal
MapReduce
batch

processing.”

8
Quick
and
dirty,
for
context.

The
Apache
Hadoop
Ecosystem

Why
“Ecosystem?”

•  In
the
beginning,
just
Hadoop

•  HDFS

•  MapReduce

•  Today,
dozens
of
interrelated
components

•  I/O

•  Processing

•  Specialty
ApplicaLons

•  ConﬁguraLon

•  Workﬂow

9

HDFS

•  Distributed,
highly
fault-‐tolerant
ﬁlesystem

•  OpLmized
for
large
streaming
access
to
data

•  Based
on
Google
File
System

•  hjp://research.google.com/archive/gfs.html

10

Lots
of
Commodity
Machines

11
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce
(MR)

•  Programming
paradigm

•  Batch
oriented,
not
realLme

•  Works
well
with
distributed
compuLng

•  Lots
of
Java,
but
other
languages
supported

•  Based
on
Google’s
paper

•  hjp://research.google.com/archive/mapreduce.html

12

Apache
Hive

•  AbstracLon
of
Hadoop’s
Java
API

•  HiveQL
“compiles”
down
to
MR

•  a
“SQL-‐like”
language

•  Eases
analysis
using
MapReduce

13

Apache
Hive
Metastore

•  Maps
HDFS
ﬁles
to
DB-‐like
resources

•  Databases

•  Tables

•  Column/ﬁeld
names,
data
types

•  Roles/users

•  InputFormat/OutputFormat

14

Sqoop

©2011 Cloudera, Inc. All Rights
Reserved.
15
•  SQL
to
Hadoop

•  Tool
to
import/export
any
JDBC-‐supported
database
into
Hadoop

•  Transfer
data
between
Hadoop
and
external
databases
or
EDW

•  High
performance
connectors
for
some
RDBMS

•  Oracle,
Teradata,
Netezza

•  Developed
at
Cloudera

17
Familiar
interface,
but
more
powerful.

Cloudera
Impala

Cloudera
Impala

18
Interac(ve
SQL
for
Hadoop

§ Responses
in
seconds

§ Nearly
ANSI-‐92
standard
SQL
with
Hive
SQL

Na(ve
MPP
Query
Engine

§ Purpose-‐built
for
low-‐latency
queries

§ Separate
runLme
from
MapReduce

§ Designed
as
part
of
the
Hadoop
ecosystem

Open
Source

§ Apache-‐licensed

Benefits
of
Impala

19
More
&
Faster
Value
from
“Big
Data”

§  InteracLve
BI/AnalyLcs
experience
via
SQL

§  No
delays
from
data
migraLon

Flexibility

§  Query
across
exisLng
data

§  Select
best-‐fit
file
formats
(Parquet,
Avro,
etc.)

§  Run
mulLple
frameworks
on
the
same
data
at
the
same
Lme

Cost
Efficiency

§  Reduce
movement,
duplicate
storage
&
compute

§  10%
to
1%
the
cost
of
analyLc
DBMS

Full
Fidelity
Analysis

§  No
loss
from
aggregaLons
or
fixed
schemas

Impala
Use
Cases

20
InteracLve
BI/analyLcs
on
more
data

Asking
new
quesLons
–
exploraLon,
ML

Data
processing
with
Lght
SLAs

Query-‐able
archive
w/full
fidelity

Cost-‐effec(ve,
ad
hoc
query
environment
that

offloads
the
data
warehouse
for:

Our
Design
Strategy

21
One
pool
of
(open)
data

One
metadata
model

One
security
framework

One
set
of
system
resources

An
Integrated
Part
of

the
Hadoop
System

In-‐Memory

Processing
&

Streaming

Spark

Storage

Integra(on

Resource
Management

Metadata

Batch

Processing

MAPREDUCE,

HIVE
&
PIG

…
HDFS
HBase

TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS

Engines

InteracLve

SQL

CLOUDERA

IMPALA

InteracLve

Search

CLOUDERA

SEARCH

Machine

Learning

MAHOUT,

ClouderaML,

Oryx

Math
&

Sta(s(cs

SAS,
R

Security

Impala
Key
Features

22
Fast
Flexible
Secure

Easy
to
Implement
Easy
to
Use
Simple
to
Manage

§  In-‐memory
data
transfers

§  ParLLoned
joins

§  Fully
distributed
aggregaLons

§  Query
data
in
HDFS
&
HBase

§  Supports
mul(ple
file
formats

&
compression
algorithms

§  Java
&
Na(ve
UDFs,
UDAFs

§  Integrated
with
Hadoop

security

§  Kerberos
authenLcaLon

§  Authoriza(on
(Sentry)

§  Leverages
Hive’s
ODBC/JDBC

connectors,
metastore
&
SQL

syntax

§  Open
source

§  Interact
with
data
via
SQL

§  CerLfied
with
leading
BI
tools

§  Deploy,
configure
&
monitor

with
Cloudera
Manager

§  Integrated
with
Hadoop

resource
management

What’s
Coming?*

23
SQL
2003-‐Compliant
AnalyLc
Window
FuncLons

AddiLonal
AuthenLcaLon
Mechanisms

User
Deﬁned
Table
FuncLons

Intra-‐node
Parallelized
AggregaLons
&
Joins

Nested
Data

Enhanced
YARN-‐Integrated
Resource
Manager

Dynamic
ParLLon
Pruning

In
the
Near
Term:

*On
the
roadmap…

no
guarantees

Impala
Plays
Well
with
Others

24
BI
Partners:

Building
on
the

Enterprise
Standard

POWERED BY
IMPALA

Not
All
SQL
On
Hadoop
Is
Created
Equal

25
Batch
MapReduce

Make
MapReduce
faster

Slow,
s(ll
batch

Remote
Query

Pull
data
from
HDFS
over

the
network
to
the
DW

compute
layer

Slow,
expensive

Siloed
DBMS

Load
data
into
a

proprietary
database
file

Rigid,
siloed
data,

slow
ETL

Impala

Na(ve
MPP
query
engine

that’s
integrated
into

Hadoop

Fast,
flexible,

cost-‐effec(ve

$

DMBS
Hadoop

More
Detail
On
AlternaLve
Approaches

26
Batch
MapReduce

§  Batch-‐oriented

§  High
latency

Remote
Query
Siloed
DBMS

Hadoop
DMBS

HDFS
Storage

Compute
Compute

§  Network
bojleneck

§  2x
the
hardware

§  Duplicate
metadata,

security,
SQL,
etc.

Storage
(HDFS)

Integra(on

Resource
Management

Hadoop
Metadata

DBMS

Hadoop

Engines

MAPREDUCE,
HIVE,
PIG,
IMPALA,
ETC.

DBMS
Metadata

PROPRIETARY
STANDARD
&
SHARED

§  RDBMS
rigidity

§  Query
subset
of
data

§  Duplicate
storage,

metadata,
security,

SQL,
etc.

Storage

Integra(on

Resource
Management

Metadata

Batch

Processing

InteracLve

SQL

Machine

Learning

HDFS
HBase

Security
Security

Other
Sexy
New
Big
Data
MPP
Tools

27
Presto

Purpose-‐Built
MPP
Engine;
Similar
Architecture
to
Impala;
Few
Performance
Comparisons,

but
Impala
Anecdotally
5x-‐10x
Faster

Shark

Hive-‐CompaLble
Data
Warehouse
for
Spark;
Great
Performance
unLl
Required
to
go
to

Disk,
at
Which
Point
Impala
Bejer;
With
HDFS
Caching
Impala
will
Perform
on
Par
from
a

Memory
PerspecLve

Drill

Open
Source
version
of
Dremel;
Another
MPP
Engine;
MulLple
Data
Formats
and
Sources

Phoenix
–
Sort
Of

SQL
Skin
over
HBase
(and
Only
HBase);
Subset
of
SQL
Standard

What
About
an
EDW/RDBMS?

“Right
Tool
for
the
Right
Job”

EDW/RDBMS
Great
For:

•  OLTP’s
complex
transacLons

•  Highly
planned
and
opLmized
known
workloads

•  Opera'onal
reports
and
repeated
known
queries

Impala
Great
For:

•  Exploratory
analy'cs
with
previously-‐unknown
queries

•  Queries
on
big
and
growing
data
sets

EDW/RDBMS
Can’t:

•  Dump
in
raw
data
then
later
deﬁne
schema
and
query
what
you
want

•  Evolve
schemas
without
an
expensive
schema
upgrade
planning
process

•  Simply
scale
just
by
adding
industry-‐standard
servers

•  Store
at
<
$1k/TB
instead
of
$10-‐150k/TB

28

29
Impala
Technical
Details

The
Impala
Advantage

30
No
MapReduce;
No
JVM;
All
NaLve

In-‐Memory
Data
Transfers

Saturate
Disks
on
Reads

OpLmized
File
Format
(ie
Parquet)

In-‐Memory
HDFS
Caching

Cost-‐Based
Join
Order
OpLmizaLon
–
Frees
User

from
Having
to
Guess
the
Correct
Join
Order

Where
does
the
Performance
Come
From?

Impala
and
Hive

31
Shares
Everything
Client-‐Facing

§  Metadata
(table
definiLons)

§  ODBC/JDBC
drivers

§  SQL
syntax
(Hive
SQL)

§  Flexible
file
formats

§  Machine
pool

§  Hue
GUI

But
Built
for
Different
Purposes

§  Hive:
runs
on
MapReduce
and

ideal
for
batch
processing

§  Impala:
naLve
MPP
query
engine

ideal
for
interacLve
SQL

Storage

Integra(on

Resource
Management

Metadata

HDFS
HBase

TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS

Hive

SQL
Syntax
Impala

SQL
Syntax
+

Compute
Framework
MapReduce

Compute
Framework

Batch

Processing

InteracLve

SQL

Impala
Query
ExecuLon

32
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
request

1)
Request
arrives
via
ODBC/JDBC/HUE/Shell

Impala
Query
ExecuLon

33
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

2)
Planner
turns
request
into
collec(ons
of
plan
fragments

3)
Coordinator
ini(ates
execu(on
on
impalad(s)
local
to
data

Impala
Query
ExecuLon

34
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

4)
Intermediate
results
are
streamed
between
impalad(s)

5)
Query
results
are
streamed
back
to
client

Query
results

Parquet
File
Format

35
Open
source,
columnar
Hadoop
ﬁle

format
developed
by
Cloudera
&
Twiler

Limits
the
IO
to
only
the
data
that
is
needed

Supports
storing
each
column
in
a
separate
ﬁle

Saves
space:
columnar
layout
compresses
bejer

Enables
bejer
scans:
load
only
the
columns
that
are
needed

Supports
index
pages
for
fast
lookup

Extensible
value
encodings

36
Impala
Performance
Results

Impala
Performance
Results

•  Impala’s
Milestone
in
Jan
2014:

•  Comparable
commercial
MPP
DBMS
speed

•  NaLvely
on
Hadoop

•  Three
Result
Sets:

•  Impala
vs
Hive
0.12
(Impala
6-‐70x
faster)

•  Impala
vs
“DBMS-‐Y”
(Impala
average
of
2x
faster)

•  Impala
scalability
(Impala
achieves
linear
scale)

•  Background

•  20
pre-‐selected,
diverse
TPC-‐DS
queries
(modiﬁed
to
remove
unsupported

language)

•  Suﬃcient
data
scale
for
realisLc
comparison
(3
TB,
15
TB,
and
30
TB)

•  RealisLc
nodes
(e.g.
8-‐core
CPU,
96GB
RAM,
12x2TB
disks)

•  Methodical
tesLng
(mulLple
runs,
reviewed
fairness
for
compeLLon,
etc)

•  Details:
hjp://blog.cloudera.com/blog/2014/01/impala-‐performance-‐dbms-‐class-‐speed/

37

Enough
slides…
DEMO
TIME!

38

So
What
is
Cloudera
Impala?

39

What’s
Next?

•  Download
Hadoop!

•  CDH
available
at
www.cloudera.com

•  Try
it
online:
Cloudera
Live

•  Cloudera
provides
pre-‐loaded
VMs

•  hjp://Lny.cloudera.com/quickstartvm

•  Ride
Impala!

•  hjp://impala.io/

40

41
SAN
DIEGO
BIG
DATA

Special
thanks:

42
Preferably
related
to
the
talk…
or
not.

QuesLons?

43
Thank
You!

Maxime
Dumas

mdumas@cloudera.com

We’re
hiring.

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Semelhante a Cloudera Impala - San Diego Big Data Meetup August 13th 2014 (20)

Mais de cdmaxime

Mais de cdmaxime (6)

Último

Último (20)

Cloudera Impala - San Diego Big Data Meetup August 13th 2014