Performance evaluation of cloudera impala (with Comparison to Hive)

•

11 gostaram•3,876 visualizações

Yukinori Suda

Tecnologia

About Cloudera impala

•  Latest version is 0.3 beta
•  Open-sourced implementation inspired by Google Dremel
and F1
•  Developed by famous Hadoop distributor Cloudera
•  Bring real-time, ad-hoc query capability on Apache Hadoop
•  Query data stored in HDFS or Apache Hbase
•  Use the same metadata, SQL syntax (HiveQL) as Apache Hive
•  Support for TextFile and SequenceFile as Hive storage format
•  Also support SequenceFile compressed as Snappy, Gzip and
Bzip
•  Directly access the data through a specialized distributed
query engine

Architecture

•  State Store works as an impala-state-store(statestored) daemon
•  Query Planner, Query Coordinator and Query Exec Engine work as an
impalad daemon

System Environment

•  Install via Cloudera Manager Free Edition
Master Slave

・HDFS
NameNode
SecondaryNameNode
・HDFS
・MapReduceV1
DataNode
JobTracker
・MapReduceV1
・impala
TaskTracker
impalad
・impala
impala-‐‑state-‐‑store
impalad
(statestored)
1 Sever 13 Servers

All servers are connected with 1Gbps Ethernet through an L2 switch

Server Speciﬁcation

•  CPU
o  Intel Core 2 Duo 2.13 GHz with Hyper Threading

•  Memory
o  4GB

•  Disk
o  7,200 rpm SATA mechanical Hard Disk Drive

•  OS
o  CentOS 6.2

Benchmark

•  Use CDH4.1 + impala version 0.2 and 0.3
•  Use hivebench in open-sourced benchmark tool
“HiBench”
o  https://github.com/hibench
•  Modified datasets to 1/10 scale
o  Default configuration generates table with 1 billion rows
•  Modified query sentence
o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance
o  Deleted “datediff” function (I mistook not to be supported)
•  Combines a few Hive storage format with a few
compression method
o  TextFile, SequenceFile, RCFile
o  No compression, Gzip, Snappy
•  Comparison with job query latency
o  Average job latency over 5 measurements

Modiﬁed Datasets

•  Uservisits table •  Rankings table
o  100 million rows o  12 million rows
o  Schema o  Schema
•  sourceIP string •  pageURL string
•  destURL string •  pageRank int
•  visitDate string •  avgDuration int
•  adRevenue double
•  userAgent string
•  countryCode string
•  languageCode string
•  searchWord string
•  duration int

Modiﬁed Query

SELECT ON
sourceIP, (R.pageURL = NUV.destURL)
sum(adRevenue) as totalRevenue,
avg(pageRank)
GROUP BY sourceIP
FROM ORDER BY totalRevenue DESC
rankings R LIMIT 1
JOIN (
SELECT
sourceIP,
destURL,
adRevenue
FROM
uservisits UV
WHERE
UV.visitData >= ‘1999-01-01’
AND UV.visitData <= ‘2001-01-01’
) NUV

Conclusion

•  Impala is over 10 times faster than MR + Hive
o  Impala 0.3
•  SequenceFile compressed as Snappy: 14.337 seconds
o  Impala 0.2
•  SequenceFile compressed as Gzip: 19.733 seconds
o  Hive
•  RCFile compressed as Snappy: 164.161 seconds

•  Hope that impala version 1.0 included in CDH5
makes faster
o  Support RCFile and Trevni columner format

Mais conteúdo relacionado

Mais procurados

Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.

How Impala WorksYue Chen

Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.

Cloudera Impala InternalsDavid Groozman

In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit

Strata London 2019 Scaling ImpalaManish Maheshwari

Real-time Big Data Analytics Engine using ImpalaJason Shih

Applications on Hadoopmarkgrover

Hive vs. ImpalaOmid Vahdaty

NYC HUG - Application Architectures with Apache Hadoopmarkgrover

Architecting Applications with Hadoopmarkgrover

HBase in Practicelarsgeorge

Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかToshihiro Suzuki

Impala Architecture presentationhadooparchbook

Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.

Cloudera impalaSwiss Big Data User Group

Low Latency SQL on Hadoop - What's best for your clusterDataWorks Summit

Kudu: Fast Analytics on Fast Datamichaelguia

Hive spark-s3acommitter-hbase-nfsYifeng Jiang

HBase Status Report - Hadoop Summit Europe 2014larsgeorge

Mais procurados (20)

Cloudera Impala: A Modern SQL Engine for Apache Hadoop

How Impala Works

Presentations from the Cloudera Impala meetup on Aug 20 2013

Cloudera Impala Internals

In-memory Caching in HDFS: Lower Latency, Same Great Taste

Strata London 2019 Scaling Impala

Real-time Big Data Analytics Engine using Impala

Applications on Hadoop

Hive vs. Impala

NYC HUG - Application Architectures with Apache Hadoop

Architecting Applications with Hadoop

HBase in Practice

Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか

Impala Architecture presentation

Cloudera Impala: A Modern SQL Engine for Hadoop

Cloudera impala

Low Latency SQL on Hadoop - What's best for your cluster

Kudu: Fast Analytics on Fast Data

Hive spark-s3acommitter-hbase-nfs

HBase Status Report - Hadoop Summit Europe 2014

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive)

Performance evaluation of cloudera impala 0.6 beta with comparison to HiveYukinori Suda

Cloudera Impala presentationmarkgrover

Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Hadoop Architecture_Cluster_Cap_PlanNarayana B

Spy hard, challenges of 100G deep packet inspection on x86 platformRedge Technologies

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21

Tajo_Meetup_20141120Hyoungjun Kim

New Analytics Toolbox DevNexus 2015Robbie Strickland

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.

Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy

An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group

Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution

PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPROIDEA

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Impala presentation ahad ranaData Con LA

Scaling HDFS to Manage Billions of FilesHaohui Mai

Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive) (20)

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

Cloudera Impala presentation

Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Hadoop Architecture_Cluster_Cap_Plan

Spy hard, challenges of 100G deep packet inspection on x86 platform

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

Tajo_Meetup_20141120

New Analytics Toolbox DevNexus 2015

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Bay Area Impala User Group Meetup (Sept 16 2014)

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

An Introduction to Impala – Low Latency Queries for Apache Hadoop

Architecting the Future of Big Data & Search - Eric Baldeschwieler

PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach

Using Apache Hive with High Performance

Impala presentation ahad rana

Scaling HDFS to Manage Billions of Files

Scaling HDFS to Manage Billions of Files with Key-Value Stores

Mais de Yukinori Suda

Hadoop operation chaper 4Yukinori Suda

Cloudera Impalaをサービスに組み込むときに苦労した話Yukinori Suda

Hadoopエコシステムを駆使したこれからのWebアクセス解析サービスYukinori Suda

自宅でHive愛を育む方法〜Raspberry Pi編〜Yukinori Suda

⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)Yukinori Suda

Evaluation of cloudera impala 1.1Yukinori Suda

HiveとImpalaのおいしいとこ取りYukinori Suda

Performance Evaluation of Cloudera Impala GAYukinori Suda

Cloudera impalaの性能評価（Hiveとの比較）Yukinori Suda

Mais de Yukinori Suda (9)

Hadoop operation chaper 4

Cloudera Impalaをサービスに組み込むときに苦労した話

Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス

自宅でHive愛を育む方法〜Raspberry Pi編〜

⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)

Evaluation of cloudera impala 1.1

HiveとImpalaのおいしいとこ取り

Performance Evaluation of Cloudera Impala GA

Cloudera impalaの性能評価（Hiveとの比較）

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

ICT role in 21st century education and its challengesrafiqahmad00786416

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Architecting Cloud Native ApplicationsWSO2

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Manulife - Insurer Transformation Award 2024The Digital Insurer

Exploring Multimodal Embeddings with MilvusZilliz

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

MS Copilot expands with MS Graph connectorsNanddeep Nachan

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Performance evaluation of cloudera impala (with Comparison to Hive)

1. Cloudera impala Performance Evaluation （with Comparison to Hive） Dec. 8, 2012 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon

2. About Cloudera impala •  Latest version is 0.3 beta •  Open-sourced implementation inspired by Google Dremel and F1 •  Developed by famous Hadoop distributor Cloudera •  Bring real-time, ad-hoc query capability on Apache Hadoop •  Query data stored in HDFS or Apache Hbase •  Use the same metadata, SQL syntax (HiveQL) as Apache Hive •  Support for TextFile and SequenceFile as Hive storage format •  Also support SequenceFile compressed as Snappy, Gzip and Bzip •  Directly access the data through a specialized distributed query engine

3. Architecture •  State Store works as an impala-state-store(statestored) daemon •  Query Planner, Query Coordinator and Query Exec Engine work as an impalad daemon

4. System Environment •  Install via Cloudera Manager Free Edition Master Slave ・HDFS NameNode SecondaryNameNode ・HDFS ・MapReduceV1 DataNode JobTracker ・MapReduceV1 ・impala TaskTracker impalad ・impala impala-‐‑state-‐‑store impalad (statestored) 1 Sever 13 Servers All servers are connected with 1Gbps Ethernet through an L2 switch

5. Server Speciﬁcation •  CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading •  Memory o  4GB •  Disk o  7,200 rpm SATA mechanical Hard Disk Drive •  OS o  CentOS 6.2

6. Benchmark •  Use CDH4.1 + impala version 0.2 and 0.3 •  Use hivebench in open-sourced benchmark tool “HiBench” o  https://github.com/hibench •  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows •  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported) •  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy •  Comparison with job query latency o  Average job latency over 5 measurements

7. Modiﬁed Datasets •  Uservisits table •  Rankings table o  100 million rows o  12 million rows o  Schema o  Schema •  sourceIP string •  pageURL string •  destURL string •  pageRank int •  visitDate string •  avgDuration int •  adRevenue double •  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int

8. Modiﬁed Query SELECT ON sourceIP, (R.pageURL = NUV.destURL) sum(adRevenue) as totalRevenue, avg(pageRank) GROUP BY sourceIP FROM ORDER BY totalRevenue DESC rankings R LIMIT 1 JOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits UV WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’ ) NUV

9. Benchmark Result （Hive）

10. Benchmark Result （impala 0.2）

11. Benchmark Result （impala 0.3）

12. Conclusion •  Impala is over 10 times faster than MR + Hive o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds o  Hive •  RCFile compressed as Snappy: 164.161 seconds •  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format

13. Thank you

Performance evaluation of cloudera impala (with Comparison to Hive)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive)

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive) (20)

Mais de Yukinori Suda

Mais de Yukinori Suda (9)

Último

Último (20)

Performance evaluation of cloudera impala (with Comparison to Hive)