Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Performance evaluation of cloudera impala (with Comparison to Hive)
1. Cloudera impala Performance
Evaluation
(with Comparison to Hive)
Dec. 8, 2012
CELLANT Corp. R&D Strategy Division
Yukinori SUDA
@sudabon
2. About Cloudera impala
• Latest version is 0.3 beta
• Open-sourced implementation inspired by Google Dremel
and F1
• Developed by famous Hadoop distributor Cloudera
• Bring real-time, ad-hoc query capability on Apache Hadoop
• Query data stored in HDFS or Apache Hbase
• Use the same metadata, SQL syntax (HiveQL) as Apache Hive
• Support for TextFile and SequenceFile as Hive storage format
• Also support SequenceFile compressed as Snappy, Gzip and
Bzip
• Directly access the data through a specialized distributed
query engine
3. Architecture
• State Store works as an impala-state-store(statestored) daemon
• Query Planner, Query Coordinator and Query Exec Engine work as an
impalad daemon
4. System Environment
• Install via Cloudera Manager Free Edition
Master Slave
・HDFS
NameNode
SecondaryNameNode
・HDFS
・MapReduceV1
DataNode
JobTracker
・MapReduceV1
・impala
TaskTracker
impalad
・impala
impala-‐‑state-‐‑store
impalad
(statestored)
1 Sever 13 Servers
All servers are connected with 1Gbps Ethernet through an L2 switch
5. Server Specification
• CPU
o Intel Core 2 Duo 2.13 GHz with Hyper Threading
• Memory
o 4GB
• Disk
o 7,200 rpm SATA mechanical Hard Disk Drive
• OS
o CentOS 6.2
6. Benchmark
• Use CDH4.1 + impala version 0.2 and 0.3
• Use hivebench in open-sourced benchmark tool
“HiBench”
o https://github.com/hibench
• Modified datasets to 1/10 scale
o Default configuration generates table with 1 billion rows
• Modified query sentence
o Deleted “INSERT INTO TABLE …” to evaluate read-only performance
o Deleted “datediff” function (I mistook not to be supported)
• Combines a few Hive storage format with a few
compression method
o TextFile, SequenceFile, RCFile
o No compression, Gzip, Snappy
• Comparison with job query latency
o Average job latency over 5 measurements
7. Modified Datasets
• Uservisits table • Rankings table
o 100 million rows o 12 million rows
o Schema o Schema
• sourceIP string • pageURL string
• destURL string • pageRank int
• visitDate string • avgDuration int
• adRevenue double
• userAgent string
• countryCode string
• languageCode string
• searchWord string
• duration int
8. Modified Query
SELECT ON
sourceIP, (R.pageURL = NUV.destURL)
sum(adRevenue) as totalRevenue,
avg(pageRank)
GROUP BY sourceIP
FROM ORDER BY totalRevenue DESC
rankings R LIMIT 1
JOIN (
SELECT
sourceIP,
destURL,
adRevenue
FROM
uservisits UV
WHERE
UV.visitData >= ‘1999-01-01’
AND UV.visitData <= ‘2001-01-01’
) NUV
12. Conclusion
• Impala is over 10 times faster than MR + Hive
o Impala 0.3
• SequenceFile compressed as Snappy: 14.337 seconds
o Impala 0.2
• SequenceFile compressed as Gzip: 19.733 seconds
o Hive
• RCFile compressed as Snappy: 164.161 seconds
• Hope that impala version 1.0 included in CDH5
makes faster
o Support RCFile and Trevni columner format