CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
1. 1
Cloudera
Impala
SD
Big
Data
Monthly
Meetup
#2
August
13th
2014
Maxime
Dumas
Systems
Engineer
2. Thirty
Seconds
About
Max
• Systems
Engineer
• aka
Sales
Engineer
• SoCal,
AZ,
NV
• former
coder
of
PHP
• teaches
meditaLon
+
yoga
• from
Montreal,
Canada
2
3. What
Does
Cloudera
Do?
• product
• distribuLon
of
Hadoop
components,
Apache
licensed
• enterprise
tooling
• support
• training
• services
(aka
consulLng)
• community
3
4. What
This
Talk
Isn’t
About
• deploying
• Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor
• sizing
&
tuning
• depends
heavily
on
data
and
workload
• coding
• unless
you
count
XML
or
CSV
or
SQL
• algorithms
4
9. Why
“Ecosystem?”
• In
the
beginning,
just
Hadoop
• HDFS
• MapReduce
• Today,
dozens
of
interrelated
components
• I/O
• Processing
• Specialty
ApplicaLons
• ConfiguraLon
• Workflow
9
10. HDFS
• Distributed,
highly
fault-‐tolerant
filesystem
• OpLmized
for
large
streaming
access
to
data
• Based
on
Google
File
System
• hjp://research.google.com/archive/gfs.html
10
12. MapReduce
(MR)
• Programming
paradigm
• Batch
oriented,
not
realLme
• Works
well
with
distributed
compuLng
• Lots
of
Java,
but
other
languages
supported
• Based
on
Google’s
paper
• hjp://research.google.com/archive/mapreduce.html
12
13. Apache
Hive
• AbstracLon
of
Hadoop’s
Java
API
• HiveQL
“compiles”
down
to
MR
• a
“SQL-‐like”
language
• Eases
analysis
using
MapReduce
13
18. Cloudera
Impala
18
Interac(ve
SQL
for
Hadoop
§ Responses
in
seconds
§ Nearly
ANSI-‐92
standard
SQL
with
Hive
SQL
Na(ve
MPP
Query
Engine
§ Purpose-‐built
for
low-‐latency
queries
§ Separate
runLme
from
MapReduce
§ Designed
as
part
of
the
Hadoop
ecosystem
Open
Source
§ Apache-‐licensed
19. Benefits
of
Impala
19
More
&
Faster
Value
from
“Big
Data”
§ InteracLve
BI/AnalyLcs
experience
via
SQL
§ No
delays
from
data
migraLon
Flexibility
§ Query
across
exisLng
data
§ Select
best-‐fit
file
formats
(Parquet,
Avro,
etc.)
§ Run
mulLple
frameworks
on
the
same
data
at
the
same
Lme
Cost
Efficiency
§ Reduce
movement,
duplicate
storage
&
compute
§ 10%
to
1%
the
cost
of
analyLc
DBMS
Full
Fidelity
Analysis
§ No
loss
from
aggregaLons
or
fixed
schemas
20. Impala
Use
Cases
20
InteracLve
BI/analyLcs
on
more
data
Asking
new
quesLons
–
exploraLon,
ML
Data
processing
with
Lght
SLAs
Query-‐able
archive
w/full
fidelity
Cost-‐effec(ve,
ad
hoc
query
environment
that
offloads
the
data
warehouse
for:
21. Our
Design
Strategy
21
One
pool
of
(open)
data
One
metadata
model
One
security
framework
One
set
of
system
resources
An
Integrated
Part
of
the
Hadoop
System
In-‐Memory
Processing
&
Streaming
Spark
Storage
Integra(on
Resource
Management
Metadata
Batch
Processing
MAPREDUCE,
HIVE
&
PIG
…
HDFS
HBase
TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS
Engines
InteracLve
SQL
CLOUDERA
IMPALA
InteracLve
Search
CLOUDERA
SEARCH
Machine
Learning
MAHOUT,
ClouderaML,
Oryx
Math
&
Sta(s(cs
SAS,
R
Security
22. Impala
Key
Features
22
Fast
Flexible
Secure
Easy
to
Implement
Easy
to
Use
Simple
to
Manage
§ In-‐memory
data
transfers
§ ParLLoned
joins
§ Fully
distributed
aggregaLons
§ Query
data
in
HDFS
&
HBase
§ Supports
mul(ple
file
formats
&
compression
algorithms
§ Java
&
Na(ve
UDFs,
UDAFs
§ Integrated
with
Hadoop
security
§ Kerberos
authenLcaLon
§ Authoriza(on
(Sentry)
§ Leverages
Hive’s
ODBC/JDBC
connectors,
metastore
&
SQL
syntax
§ Open
source
§ Interact
with
data
via
SQL
§ CerLfied
with
leading
BI
tools
§ Deploy,
configure
&
monitor
with
Cloudera
Manager
§ Integrated
with
Hadoop
resource
management
23. What’s
Coming?*
23
SQL
2003-‐Compliant
AnalyLc
Window
FuncLons
AddiLonal
AuthenLcaLon
Mechanisms
User
Defined
Table
FuncLons
Intra-‐node
Parallelized
AggregaLons
&
Joins
Nested
Data
Enhanced
YARN-‐Integrated
Resource
Manager
Dynamic
ParLLon
Pruning
In
the
Near
Term:
*On
the
roadmap…
no
guarantees
24. Impala
Plays
Well
with
Others
24
BI
Partners:
Building
on
the
Enterprise
Standard
POWERED BY
IMPALA
25. Not
All
SQL
On
Hadoop
Is
Created
Equal
25
Batch
MapReduce
Make
MapReduce
faster
Slow,
s(ll
batch
Remote
Query
Pull
data
from
HDFS
over
the
network
to
the
DW
compute
layer
Slow,
expensive
Siloed
DBMS
Load
data
into
a
proprietary
database
file
Rigid,
siloed
data,
slow
ETL
Impala
Na(ve
MPP
query
engine
that’s
integrated
into
Hadoop
Fast,
flexible,
cost-‐effec(ve
$
26. DMBS
Hadoop
More
Detail
On
AlternaLve
Approaches
26
Batch
MapReduce
§ Batch-‐oriented
§ High
latency
Remote
Query
Siloed
DBMS
Hadoop
DMBS
HDFS
Storage
Compute
Compute
§ Network
bojleneck
§ 2x
the
hardware
§ Duplicate
metadata,
security,
SQL,
etc.
Storage
(HDFS)
Integra(on
Resource
Management
Hadoop
Metadata
DBMS
Hadoop
Engines
MAPREDUCE,
HIVE,
PIG,
IMPALA,
ETC.
DBMS
Metadata
PROPRIETARY
STANDARD
&
SHARED
§ RDBMS
rigidity
§ Query
subset
of
data
§ Duplicate
storage,
metadata,
security,
SQL,
etc.
Storage
Integra(on
Resource
Management
Metadata
Batch
Processing
InteracLve
SQL
Machine
Learning
HDFS
HBase
Security
Security
27. Other
Sexy
New
Big
Data
MPP
Tools
27
Presto
Purpose-‐Built
MPP
Engine;
Similar
Architecture
to
Impala;
Few
Performance
Comparisons,
but
Impala
Anecdotally
5x-‐10x
Faster
Shark
Hive-‐CompaLble
Data
Warehouse
for
Spark;
Great
Performance
unLl
Required
to
go
to
Disk,
at
Which
Point
Impala
Bejer;
With
HDFS
Caching
Impala
will
Perform
on
Par
from
a
Memory
PerspecLve
Drill
Open
Source
version
of
Dremel;
Another
MPP
Engine;
MulLple
Data
Formats
and
Sources
Phoenix
–
Sort
Of
SQL
Skin
over
HBase
(and
Only
HBase);
Subset
of
SQL
Standard
28. What
About
an
EDW/RDBMS?
“Right
Tool
for
the
Right
Job”
EDW/RDBMS
Great
For:
• OLTP’s
complex
transacLons
• Highly
planned
and
opLmized
known
workloads
• Opera'onal
reports
and
repeated
known
queries
Impala
Great
For:
• Exploratory
analy'cs
with
previously-‐unknown
queries
• Queries
on
big
and
growing
data
sets
EDW/RDBMS
Can’t:
• Dump
in
raw
data
then
later
define
schema
and
query
what
you
want
• Evolve
schemas
without
an
expensive
schema
upgrade
planning
process
• Simply
scale
just
by
adding
industry-‐standard
servers
• Store
at
<
$1k/TB
instead
of
$10-‐150k/TB
28
30. The
Impala
Advantage
30
No
MapReduce;
No
JVM;
All
NaLve
In-‐Memory
Data
Transfers
Saturate
Disks
on
Reads
OpLmized
File
Format
(ie
Parquet)
In-‐Memory
HDFS
Caching
Cost-‐Based
Join
Order
OpLmizaLon
–
Frees
User
from
Having
to
Guess
the
Correct
Join
Order
Where
does
the
Performance
Come
From?
31. Impala
and
Hive
31
Shares
Everything
Client-‐Facing
§ Metadata
(table
definiLons)
§ ODBC/JDBC
drivers
§ SQL
syntax
(Hive
SQL)
§ Flexible
file
formats
§ Machine
pool
§ Hue
GUI
But
Built
for
Different
Purposes
§ Hive:
runs
on
MapReduce
and
ideal
for
batch
processing
§ Impala:
naLve
MPP
query
engine
ideal
for
interacLve
SQL
Storage
Integra(on
Resource
Management
Metadata
HDFS
HBase
TEXT,
RCFILE,
PARQUET,
AVRO,
ETC.
RECORDS
Hive
SQL
Syntax
Impala
SQL
Syntax
+
Compute
Framework
MapReduce
Compute
Framework
Batch
Processing
InteracLve
SQL
33. Impala
Query
ExecuLon
33
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Hive
Metastore
HDFS
NN
Statestore
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
2)
Planner
turns
request
into
collec(ons
of
plan
fragments
3)
Coordinator
ini(ates
execu(on
on
impalad(s)
local
to
data
34. Impala
Query
ExecuLon
34
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Hive
Metastore
HDFS
NN
Statestore
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
4)
Intermediate
results
are
streamed
between
impalad(s)
5)
Query
results
are
streamed
back
to
client
Query
results
35. Parquet
File
Format
35
Open
source,
columnar
Hadoop
file
format
developed
by
Cloudera
&
Twiler
Limits
the
IO
to
only
the
data
that
is
needed
Supports
storing
each
column
in
a
separate
file
Saves
space:
columnar
layout
compresses
bejer
Enables
bejer
scans:
load
only
the
columns
that
are
needed
Supports
index
pages
for
fast
lookup
Extensible
value
encodings