Performance evaluation of cloudera impala (with Comparison to Hive)
1. Cloudera impala Performance
Evaluation
(with Comparison to Hive)
Dec. 8, 2012
CELLANT Corp. R&D Strategy Division
Yukinori SUDA
@sudabon
2. About Cloudera impala
• Latest version is 0.3 beta
• Open-sourced implementation inspired by Google Dremel
and F1
• Developed by famous Hadoop distributor Cloudera
• Bring real-time, ad-hoc query capability on Apache Hadoop
• Query data stored in HDFS or Apache Hbase
• Use the same metadata, SQL syntax (HiveQL) as Apache Hive
• Support for TextFile and SequenceFile as Hive storage format
• Also support SequenceFile compressed as Snappy, Gzip and
Bzip
• Directly access the data through a specialized distributed
query engine
3. Architecture
• State Store works as an impala-state-store(statestored) daemon
• Query Planner, Query Coordinator and Query Exec Engine work as an
impalad daemon
4. System Environment
• Install via Cloudera Manager Free Edition
Master Slave
・HDFS
NameNode
SecondaryNameNode
・HDFS
・MapReduceV1
DataNode
JobTracker
・MapReduceV1
・impala
TaskTracker
impalad
・impala
impala-‐‑state-‐‑store
impalad
(statestored)
1 Sever 13 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
5. Server Specification
• CPU
o Intel Core 2 Duo 2.13 GHz with Hyper Threading
• Memory
o 4GB
• Disk
o 7,200 rpm SATA mechanical Hard Disk Drive
• OS
o CentOS 6.2
6. Benchmark
• Use CDH4.1 + impala version 0.2 and 0.3
• Use hivebench in open-sourced benchmark tool
“HiBench”
o https://github.com/hibench
• Modified datasets to 1/10 scale
o Default configuration generates table with 1 billion rows
• Modified query sentence
o Deleted “INSERT INTO TABLE …” to evaluate read-only performance
o Deleted “datediff” function (I mistook not to be supported)
• Combines a few Hive storage format with a few
compression method
o TextFile, SequenceFile, RCFile
o No compression, Gzip, Snappy
• Comparison with job query latency
o Average job latency over 5 measurements
7. Modified Datasets
• Uservisits table • Rankings table
o 100 million rows o 12 million rows
o Schema o Schema
• sourceIP string • pageURL string
• destURL string • pageRank int
• visitDate string • avgDuration int
• adRevenue double
• userAgent string
• countryCode string
• languageCode string
• searchWord string
• duration int
8. Modified Query
SELECT ON
sourceIP, (R.pageURL = NUV.destURL)
sum(adRevenue) as totalRevenue,
avg(pageRank)
GROUP BY sourceIP
FROM ORDER BY totalRevenue DESC
rankings R LIMIT 1
JOIN (
SELECT
sourceIP,
destURL,
adRevenue
FROM
uservisits UV
WHERE
UV.visitData >= ‘1999-01-01’
AND UV.visitData <= ‘2001-01-01’
) NUV
12. Conclusion
• Impala is over 10 times faster than MR + Hive
o Impala 0.3
• SequenceFile compressed as Snappy: 14.337 seconds
o Impala 0.2
• SequenceFile compressed as Gzip: 19.733 seconds
o Hive
• RCFile compressed as Snappy: 164.161 seconds
• Hope that impala version 1.0 included in CDH5
makes faster
o Support RCFile and Trevni columner format