Performance evaluation of cloudera impala (with Comparison to Hive)

•

11 gostaram•3,876 visualizações

Yukinori Suda

Tecnologia

About Cloudera impala

•  Latest version is 0.3 beta
•  Open-sourced implementation inspired by Google Dremel
and F1
•  Developed by famous Hadoop distributor Cloudera
•  Bring real-time, ad-hoc query capability on Apache Hadoop
•  Query data stored in HDFS or Apache Hbase
•  Use the same metadata, SQL syntax (HiveQL) as Apache Hive
•  Support for TextFile and SequenceFile as Hive storage format
•  Also support SequenceFile compressed as Snappy, Gzip and
Bzip
•  Directly access the data through a specialized distributed
query engine

Architecture

•  State Store works as an impala-state-store(statestored) daemon
•  Query Planner, Query Coordinator and Query Exec Engine work as an
impalad daemon

System Environment

•  Install via Cloudera Manager Free Edition
Master Slave

・HDFS
NameNode
SecondaryNameNode
・HDFS
・MapReduceV1
DataNode
JobTracker
・MapReduceV1
・impala
TaskTracker
impalad
・impala
impala-‐‑state-‐‑store
impalad
(statestored)
1 Sever 13 Servers

All servers are connected with 1Gbps Ethernet through an L2 switch

Server Speciﬁcation

•  CPU
o  Intel Core 2 Duo 2.13 GHz with Hyper Threading

•  Memory
o  4GB

•  Disk
o  7,200 rpm SATA mechanical Hard Disk Drive

•  OS
o  CentOS 6.2

Benchmark

•  Use CDH4.1 + impala version 0.2 and 0.3
•  Use hivebench in open-sourced benchmark tool
“HiBench”
o  https://github.com/hibench
•  Modified datasets to 1/10 scale
o  Default configuration generates table with 1 billion rows
•  Modified query sentence
o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance
o  Deleted “datediff” function (I mistook not to be supported)
•  Combines a few Hive storage format with a few
compression method
o  TextFile, SequenceFile, RCFile
o  No compression, Gzip, Snappy
•  Comparison with job query latency
o  Average job latency over 5 measurements

Modiﬁed Datasets

•  Uservisits table •  Rankings table
o  100 million rows o  12 million rows
o  Schema o  Schema
•  sourceIP string •  pageURL string
•  destURL string •  pageRank int
•  visitDate string •  avgDuration int
•  adRevenue double
•  userAgent string
•  countryCode string
•  languageCode string
•  searchWord string
•  duration int

Modiﬁed Query

SELECT ON
sourceIP, (R.pageURL = NUV.destURL)
sum(adRevenue) as totalRevenue,
avg(pageRank)
GROUP BY sourceIP
FROM ORDER BY totalRevenue DESC
rankings R LIMIT 1
JOIN (
SELECT
sourceIP,
destURL,
adRevenue
FROM
uservisits UV
WHERE
UV.visitData >= ‘1999-01-01’
AND UV.visitData <= ‘2001-01-01’
) NUV

Conclusion

•  Impala is over 10 times faster than MR + Hive
o  Impala 0.3
•  SequenceFile compressed as Snappy: 14.337 seconds
o  Impala 0.2
•  SequenceFile compressed as Gzip: 19.733 seconds
o  Hive
•  RCFile compressed as Snappy: 164.161 seconds

•  Hope that impala version 1.0 included in CDH5
makes faster
o  Support RCFile and Trevni columner format

Mais conteúdo relacionado

Mais procurados

Cloudera Impala: A Modern SQL Engine for Apache HadoopCloudera, Inc.

How Impala WorksYue Chen

Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.

Cloudera Impala InternalsDavid Groozman

In-memory Caching in HDFS: Lower Latency, Same Great TasteDataWorks Summit

Strata London 2019 Scaling ImpalaManish Maheshwari

Real-time Big Data Analytics Engine using ImpalaJason Shih

Applications on Hadoopmarkgrover

Hive vs. ImpalaOmid Vahdaty

NYC HUG - Application Architectures with Apache Hadoopmarkgrover

Architecting Applications with Hadoopmarkgrover

HBase in Practicelarsgeorge

Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのかToshihiro Suzuki

Impala Architecture presentationhadooparchbook

Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.

Cloudera impalaSwiss Big Data User Group

Low Latency SQL on Hadoop - What's best for your clusterDataWorks Summit

Kudu: Fast Analytics on Fast Datamichaelguia

Hive spark-s3acommitter-hbase-nfsYifeng Jiang

HBase Status Report - Hadoop Summit Europe 2014larsgeorge

Mais procurados (20)

Cloudera Impala: A Modern SQL Engine for Apache Hadoop

How Impala Works

Presentations from the Cloudera Impala meetup on Aug 20 2013

Cloudera Impala Internals

In-memory Caching in HDFS: Lower Latency, Same Great Taste

Strata London 2019 Scaling Impala

Real-time Big Data Analytics Engine using Impala

Applications on Hadoop

Hive vs. Impala

NYC HUG - Application Architectures with Apache Hadoop

Architecting Applications with Hadoop

HBase in Practice

Apache HBaseの現在 - 火山と呼ばれたHBaseは今どうなっているのか

Impala Architecture presentation

Cloudera Impala: A Modern SQL Engine for Hadoop

Cloudera impala

Low Latency SQL on Hadoop - What's best for your cluster

Kudu: Fast Analytics on Fast Data

Hive spark-s3acommitter-hbase-nfs

HBase Status Report - Hadoop Summit Europe 2014

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive)

Performance evaluation of cloudera impala 0.6 beta with comparison to HiveYukinori Suda

Cloudera Impala presentationmarkgrover

Jan 2013 HUG: Impala - Real-time Queries for Apache HadoopYahoo Developer Network

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013Modern Data Stack France

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Hadoop Architecture_Cluster_Cap_PlanNarayana B

Spy hard, challenges of 100G deep packet inspection on x86 platformRedge Technologies

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21

Tajo_Meetup_20141120Hyoungjun Kim

New Analytics Toolbox DevNexus 2015Robbie Strickland

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Bay Area Impala User Group Meetup (Sept 16 2014)Cloudera, Inc.

Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy

An Introduction to Impala – Low Latency Queries for Apache HadoopChicago Hadoop Users Group

Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution

PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietachPROIDEA

Using Apache Hive with High PerformanceInderaj (Raj) Bains

Impala presentation ahad ranaData Con LA

Scaling HDFS to Manage Billions of FilesHaohui Mai

Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive) (20)

Performance evaluation of cloudera impala 0.6 beta with comparison to Hive

Cloudera Impala presentation

Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop

Marcel Kornacker: Impala tech talk Tue Feb 26th 2013

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Hadoop Architecture_Cluster_Cap_Plan

Spy hard, challenges of 100G deep packet inspection on x86 platform

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

Tajo_Meetup_20141120

New Analytics Toolbox DevNexus 2015

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Bay Area Impala User Group Meetup (Sept 16 2014)

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

An Introduction to Impala – Low Latency Queries for Apache Hadoop

Architecting the Future of Big Data & Search - Eric Baldeschwieler

PLNOG 18 - Paweł Małachowski - Spy hard czyli regexpem po pakietach

Using Apache Hive with High Performance

Impala presentation ahad rana

Scaling HDFS to Manage Billions of Files

Scaling HDFS to Manage Billions of Files with Key-Value Stores

Mais de Yukinori Suda

Hadoop operation chaper 4Yukinori Suda

Cloudera Impalaをサービスに組み込むときに苦労した話Yukinori Suda

Hadoopエコシステムを駆使したこれからのWebアクセス解析サービスYukinori Suda

自宅でHive愛を育む方法〜Raspberry Pi編〜Yukinori Suda

⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)Yukinori Suda

Evaluation of cloudera impala 1.1Yukinori Suda

HiveとImpalaのおいしいとこ取りYukinori Suda

Performance Evaluation of Cloudera Impala GAYukinori Suda

Cloudera impalaの性能評価（Hiveとの比較）Yukinori Suda

Mais de Yukinori Suda (9)

Hadoop operation chaper 4

Cloudera Impalaをサービスに組み込むときに苦労した話

Hadoopエコシステムを駆使したこれからのWebアクセス解析サービス

自宅でHive愛を育む方法〜Raspberry Pi編〜

⾃宅で Hive 愛を育むための⼿順(Raspberry Pi 編)

Evaluation of cloudera impala 1.1

HiveとImpalaのおいしいとこ取り

Performance Evaluation of Cloudera Impala GA

Cloudera impalaの性能評価（Hiveとの比較）

Último

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

A Domino Admins Adventures (Engage 2024)Gabriella Davis

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Slack Application Development 101 Slidespraypatel2

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

How to convert PDF to text with Nanonetsnaman860154

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Histor y of HAM Radio presentation slidevu2urc

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Performance evaluation of cloudera impala (with Comparison to Hive)

1. Cloudera impala Performance Evaluation （with Comparison to Hive） Dec. 8, 2012 CELLANT Corp. R&D Strategy Division Yukinori SUDA @sudabon

2. About Cloudera impala •  Latest version is 0.3 beta •  Open-sourced implementation inspired by Google Dremel and F1 •  Developed by famous Hadoop distributor Cloudera •  Bring real-time, ad-hoc query capability on Apache Hadoop •  Query data stored in HDFS or Apache Hbase •  Use the same metadata, SQL syntax (HiveQL) as Apache Hive •  Support for TextFile and SequenceFile as Hive storage format •  Also support SequenceFile compressed as Snappy, Gzip and Bzip •  Directly access the data through a specialized distributed query engine

3. Architecture •  State Store works as an impala-state-store(statestored) daemon •  Query Planner, Query Coordinator and Query Exec Engine work as an impalad daemon

4. System Environment •  Install via Cloudera Manager Free Edition Master Slave ・HDFS NameNode SecondaryNameNode ・HDFS ・MapReduceV1 DataNode JobTracker ・MapReduceV1 ・impala TaskTracker impalad ・impala impala-‐‑state-‐‑store impalad (statestored) 1 Sever 13 Servers All servers are connected with 1Gbps Ethernet through an L2 switch

5. Server Speciﬁcation •  CPU o  Intel Core 2 Duo 2.13 GHz with Hyper Threading •  Memory o  4GB •  Disk o  7,200 rpm SATA mechanical Hard Disk Drive •  OS o  CentOS 6.2

6. Benchmark •  Use CDH4.1 + impala version 0.2 and 0.3 •  Use hivebench in open-sourced benchmark tool “HiBench” o  https://github.com/hibench •  Modified datasets to 1/10 scale o  Default configuration generates table with 1 billion rows •  Modified query sentence o  Deleted “INSERT INTO TABLE …” to evaluate read-only performance o  Deleted “datediff” function (I mistook not to be supported) •  Combines a few Hive storage format with a few compression method o  TextFile, SequenceFile, RCFile o  No compression, Gzip, Snappy •  Comparison with job query latency o  Average job latency over 5 measurements

7. Modiﬁed Datasets •  Uservisits table •  Rankings table o  100 million rows o  12 million rows o  Schema o  Schema •  sourceIP string •  pageURL string •  destURL string •  pageRank int •  visitDate string •  avgDuration int •  adRevenue double •  userAgent string •  countryCode string •  languageCode string •  searchWord string •  duration int

8. Modiﬁed Query SELECT ON sourceIP, (R.pageURL = NUV.destURL) sum(adRevenue) as totalRevenue, avg(pageRank) GROUP BY sourceIP FROM ORDER BY totalRevenue DESC rankings R LIMIT 1 JOIN ( SELECT sourceIP, destURL, adRevenue FROM uservisits UV WHERE UV.visitData >= ‘1999-01-01’ AND UV.visitData <= ‘2001-01-01’ ) NUV

9. Benchmark Result （Hive）

10. Benchmark Result （impala 0.2）

11. Benchmark Result （impala 0.3）

12. Conclusion •  Impala is over 10 times faster than MR + Hive o  Impala 0.3 •  SequenceFile compressed as Snappy: 14.337 seconds o  Impala 0.2 •  SequenceFile compressed as Gzip: 19.733 seconds o  Hive •  RCFile compressed as Snappy: 164.161 seconds •  Hope that impala version 1.0 included in CDH5 makes faster o  Support RCFile and Trevni columner format

13. Thank you

Performance evaluation of cloudera impala (with Comparison to Hive)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive)

Semelhante a Performance evaluation of cloudera impala (with Comparison to Hive) (20)

Mais de Yukinori Suda

Mais de Yukinori Suda (9)

Último

Último (20)

Performance evaluation of cloudera impala (with Comparison to Hive)