SlideShare uma empresa Scribd logo
1 de 28
Baixar para ler offline
The state of in the cloud
Nicolas Poggi
Oct 2017
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_
/_/
Outline
1. Intro
1. PaaS Cloud
2. BigBench
2. Part I – Scalability
1. Hive vs. Spark
3. Part II – Additional experiments
1. Versions, Concurrency, 10TB
4. Summary
2
Motivation
• 2016 SQL-on-Hadoop paper and presentations
• Focused on Hive, due to SparkSQL not being ready to use in PaaS
• Used benchmark (TPC-H)
• Early 2017, BigBench testing Hive and Spark and TPC-TC paper
• New code available in May for MLlib2 compatibility
• Goals:
Evaluate the current out-of-the-box experience of Spark v2 in PaaS cloud
• Using Hive as baseline
3
Platform-as-a-Service Spark
• Simplified management
• Cloud-based managed Hadoop services
• Ready to use Spark, Hive, …
• Deploys in minutes, on-demand, elastic
• Pay-as-you-go pricing model
• Decoupled compute and storage
• Optimized for general purpose
• Fined tuned to the cloud provider architecture
4
Surveyed PaaS services
• Amazon Elastic Map Reduce (EMR)
• Released: Apr 2009
• OS: Amazon Linux AMI (RHEL-like)
• Spark 2.1.0 and Hive 2.1 (Tez)
• VM: M4.2xlarge (32GB RAM)
• Google Cloud DataProc (GCD)
• Released: Feb 2016
• OS: Debian 8.4
• Spark 2.1.0 (preview), Hive 2.1 (M/R)
• VM: n1-standard-8 (30GB RAM)
• Azure HDInsight (HDI)
• Released: Oct 2013
• OS: Ubuntu 16.04 (HDP-based)
• Spark 2.1.0 and 1.6.3, Hive 1.2 (Tez)
• VM: D4v2 (28GB RAM)
• Target deployment 128-cores:
• 16 data nodes with 8-cores each
• Master node with 16-cores
• Decoupled storage only
• EBS, WASB, GCS
5
What is BigBench (TPCx-BB)
• End-to-end application level benchmark specification
• result of many years of collaboration of industry and academia
• Covers most Big Data Analytical properties (3Vs)
• 30 business use cases for a retailer company
• Merchandizing,
• pricing,
• customers …
• Defines data scale factors
• 1GB to PBs
6
Retailer database
Sequential Hive vs Spark 2.1
Queries 1-30 on Spark 2.1 (power runs)
Query 1 Query 2 …. Query 30
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 2.1.0
/_/
BB 1GB-1TB Scalability Dataproc:
Hive 2.1 (M/R) vs Spark 2.1
BB 1GB-1TB Scalability EMR:
Hive 2.1 (Tez) vs Spark 2.1
BB 1GB-1TB Scalability HDI:
Hive 1.2 (Tez) vs Spark 2.1
BB 1GB-1TB Scalability: Hive vs Spark 2.1
All providers
BB 1TB Power runs : Hive vs Spark 2.1 (ALL)
CPU % Q5 (ML) in Hive and Spark (HDI)
13
• Hive (MLlib2) • Spark (MLlib2)
Time (s) Time (s) - 2X faster
Radar charts – query characterization
• Useful for displaying multivariate
data (5 resources)
• Quickly identify similarities and
differences.
• From example
• Hive and Spark
• Only Disk Write is similar
• Hive consumes more MEM and CPU
• Spark read more from disk (DISK_R)
• And moderately more network
Sample radar chart for Q7 in EMR at 1TB 14
BB 1TB Query 5 (ML) providers comparison
Hive (MLlib2) Spark (MLlib2)
15
Other comparisons:
10TB SQL-Only
2.0.2 vs 2.1.0
1.6.3 vs 2.1.0
MLlib v1 vs v2
16
BB 1GB-10TB Scalability SQL-only queries
Hive
Spark
BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP)
Notes:
Spark 2.1 a bit faster at
small scales, slower at
100 GB and 1 TB on the
UDF/NLP queries
2.1 faster up
to 100GB
Slower at 1TB
BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0
MLlib 1 vs 2.1 MLlib 2(HDI)
Notes:
• Spark 2.1 is always
faster than 1.6.3 in
HDI
• MLlilb 2 using
dataframes over RDDs
is only slightly faster
than V1.
Concurrency runs (throughput)
SQL-only: 100GB – 1TB 512-core cluster
2020
BB Throughput at 100GB 8 streams SQL-only
(512-cores)
BB Throughput at 1TB 4 streams SQL-only
(512-cores)
Conclusions
• All providers have up to date (2.1.0) and well tuned versions of Spark
• They could run BigBench up to 1TB on medium-sized cluster,
• [Almost] Out-of-the box
• Performance similar among providers for similar cluster types and disk configs
• Difference according to scale (and pricing)
• Spark 2.1.0 is faster than previous versions
• Also MLlib 2 with dataframes
• But improvements within the 30% range
• Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales for sequential
• But Spark significantly faster at high data scales and concurrency
• BigBench has been useful to stress a cluster with different workloads
• Highlights config problems fast and stresses scale limits
• Helpful for tuning the clusters
23
Resources and references
BigBench and ALOJA
• BigBench Spark 2 branch (thanks Christoph
and Michael from bankmark.de):
• https://github.com/carabolic/Big-Data-
Benchmark-for-Big-Bench/tree/spark2
• Original BigBench Implementation
repository
• https://github.com/intel-hadoop/Big-Data-
Benchmark-for-Big-Bench
• ALOJA benchmarking platform
• https://github.com/Aloja/aloja
• ALOJA fork of BigBench (adds support for
HDI and fixes spark)
• https://github.com/Aloja/Big-Data-Benchmark-
for-Big-Bench
Papers and slides
• https://www.slideshare.net/ni_po
• Characterizing TPCx-BB Queries,
Hive, and Spark in Multi-Cloud
Environments – N. Poggi et. Al
• TPC-TC 2017
• The State of SQL-on-Hadoop in the
Cloud – N. Poggi et. al.
• IEEE Big Data 2016
• https://doi.org/10.1109/BigData.2016
.7840751
24
Thanks, questions?
Follow up / feedback : Npoggi@ac.upc.edu
Twitter: ni_po
The state of in the cloud
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_
/_/
Extra slides
26
BB 1TB Query 2 (M/R) providers comparison
Hive Spark
27
Spark config
EMR CDP HDI
Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131
Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76
Driver memory 5G 5G 5G
Executor memory 5G 10G 4G
Executor cores 4 4 3
Executor instances Dynamic Dynamic 20
dynamicAllocation
enabled
TRUE TRUE FALSE
Executor
memoryOverhead
Default (384MB) 1,117 MB Default (384MB)
28

Mais conteúdo relacionado

Mais procurados

Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesMesosphere Inc.
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)DataWorks Summit
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Data Con LA
 
Innovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and ModulesInnovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and ModulesHPCC Systems
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxData
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureDiscover Pinterest
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020confluent
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggleconfluent
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...StreamNative
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @UberFuture of Data Meetup
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaValery Tkachenko
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHParis Data Engineers !
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Redis Labs
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyonJohan hong
 

Mais procurados (20)

Episode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data ServicesEpisode 3: Kubernetes and Big Data Services
Episode 3: Kubernetes and Big Data Services
 
Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)Kafka to the Maxka - (Kafka Performance Tuning)
Kafka to the Maxka - (Kafka Performance Tuning)
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Innovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and ModulesInnovation with Connection, The new HPCC Systems Plugins and Modules
Innovation with Connection, The new HPCC Systems Plugins and Modules
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
 
Singer, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging InfrastructureSinger, Pinterest's Logging Infrastructure
Singer, Pinterest's Logging Infrastructure
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
How do you decide where your customer was?
How do you decide where your customer was?How do you decide where your customer was?
How do you decide where your customer was?
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...The Evolution of Trillion-level Real-time Messaging System in BIGO  - Puslar ...
The Evolution of Trillion-level Real-time Messaging System in BIGO - Puslar ...
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Hive on Spark, production experience @Uber
 Hive on Spark, production experience @Uber Hive on Spark, production experience @Uber
Hive on Spark, production experience @Uber
 
Clickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek VavrusaClickhouse at Cloudflare. By Marek Vavrusa
Clickhouse at Cloudflare. By Marek Vavrusa
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
Scaling Redis Cluster Deployments for Genome Analysis (featuring LSU) - Terry...
 
Spark, spark streaming & tachyon
Spark, spark streaming & tachyonSpark, spark streaming & tachyon
Spark, spark streaming & tachyon
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 

Semelhante a State of cloud in the PaaS

Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph Community
 
How SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the GameHow SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the GamePARIKSHIT SAVJANI
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4Michael Kehoe
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungRendy Bambang Junior
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...Databricks
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol PerformanceBIOVIA
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Petr Novotný
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathChester Chen
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...viirya
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Spark Summit
 

Semelhante a State of cloud in the PaaS (20)

Ceph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der SterCeph for Big Science - Dan van der Ster
Ceph for Big Science - Dan van der Ster
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
How SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the GameHow SQL Server 2016 SP1 Changes the Game
How SQL Server 2016 SP1 Changes the Game
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4CouchbasetoHadoop_Matt_Michael_Justin v4
CouchbasetoHadoop_Matt_Michael_Justin v4
 
Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev Bandung
 
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
BigDL: Bringing Ease of Use of Deep Learning for Apache Spark with Jason Dai ...
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance(ATS3-PLAT08) Optimizing Protocol Performance
(ATS3-PLAT08) Optimizing Protocol Performance
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0
 
Index conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreathIndex conf sparkml-feb20-n-pentreath
Index conf sparkml-feb20-n-pentreath
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesKubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
 

Mais de Nicolas Poggi

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsNicolas Poggi
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLNicolas Poggi
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Nicolas Poggi
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheNicolas Poggi
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Nicolas Poggi
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performanceNicolas Poggi
 

Mais de Nicolas Poggi (9)

Benchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA ConstraintsBenchmarking Elastic Cloud Big Data Services under SLA Constraints
Benchmarking Elastic Cloud Big Data Services under SLA Constraints
 
Correctness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQLCorrectness and Performance of Apache Spark SQL
Correctness and Performance of Apache Spark SQL
 
Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)Using BigBench to compare Hive and Spark (short version)
Using BigBench to compare Hive and Spark (short version)
 
Accelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket CacheAccelerating HBase with NVMe and Bucket Cache
Accelerating HBase with NVMe and Bucket Cache
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]Vagrant + Docker provider [+Puppet]
Vagrant + Docker provider [+Puppet]
 
The case for Hadoop performance
The case for Hadoop performanceThe case for Hadoop performance
The case for Hadoop performance
 

Último

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 

Último (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 

State of cloud in the PaaS

  • 1. The state of in the cloud Nicolas Poggi Oct 2017 ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ /_/
  • 2. Outline 1. Intro 1. PaaS Cloud 2. BigBench 2. Part I – Scalability 1. Hive vs. Spark 3. Part II – Additional experiments 1. Versions, Concurrency, 10TB 4. Summary 2
  • 3. Motivation • 2016 SQL-on-Hadoop paper and presentations • Focused on Hive, due to SparkSQL not being ready to use in PaaS • Used benchmark (TPC-H) • Early 2017, BigBench testing Hive and Spark and TPC-TC paper • New code available in May for MLlib2 compatibility • Goals: Evaluate the current out-of-the-box experience of Spark v2 in PaaS cloud • Using Hive as baseline 3
  • 4. Platform-as-a-Service Spark • Simplified management • Cloud-based managed Hadoop services • Ready to use Spark, Hive, … • Deploys in minutes, on-demand, elastic • Pay-as-you-go pricing model • Decoupled compute and storage • Optimized for general purpose • Fined tuned to the cloud provider architecture 4
  • 5. Surveyed PaaS services • Amazon Elastic Map Reduce (EMR) • Released: Apr 2009 • OS: Amazon Linux AMI (RHEL-like) • Spark 2.1.0 and Hive 2.1 (Tez) • VM: M4.2xlarge (32GB RAM) • Google Cloud DataProc (GCD) • Released: Feb 2016 • OS: Debian 8.4 • Spark 2.1.0 (preview), Hive 2.1 (M/R) • VM: n1-standard-8 (30GB RAM) • Azure HDInsight (HDI) • Released: Oct 2013 • OS: Ubuntu 16.04 (HDP-based) • Spark 2.1.0 and 1.6.3, Hive 1.2 (Tez) • VM: D4v2 (28GB RAM) • Target deployment 128-cores: • 16 data nodes with 8-cores each • Master node with 16-cores • Decoupled storage only • EBS, WASB, GCS 5
  • 6. What is BigBench (TPCx-BB) • End-to-end application level benchmark specification • result of many years of collaboration of industry and academia • Covers most Big Data Analytical properties (3Vs) • 30 business use cases for a retailer company • Merchandizing, • pricing, • customers … • Defines data scale factors • 1GB to PBs 6 Retailer database
  • 7. Sequential Hive vs Spark 2.1 Queries 1-30 on Spark 2.1 (power runs) Query 1 Query 2 …. Query 30 Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 2.1.0 /_/
  • 8. BB 1GB-1TB Scalability Dataproc: Hive 2.1 (M/R) vs Spark 2.1
  • 9. BB 1GB-1TB Scalability EMR: Hive 2.1 (Tez) vs Spark 2.1
  • 10. BB 1GB-1TB Scalability HDI: Hive 1.2 (Tez) vs Spark 2.1
  • 11. BB 1GB-1TB Scalability: Hive vs Spark 2.1 All providers
  • 12. BB 1TB Power runs : Hive vs Spark 2.1 (ALL)
  • 13. CPU % Q5 (ML) in Hive and Spark (HDI) 13 • Hive (MLlib2) • Spark (MLlib2) Time (s) Time (s) - 2X faster
  • 14. Radar charts – query characterization • Useful for displaying multivariate data (5 resources) • Quickly identify similarities and differences. • From example • Hive and Spark • Only Disk Write is similar • Hive consumes more MEM and CPU • Spark read more from disk (DISK_R) • And moderately more network Sample radar chart for Q7 in EMR at 1TB 14
  • 15. BB 1TB Query 5 (ML) providers comparison Hive (MLlib2) Spark (MLlib2) 15
  • 16. Other comparisons: 10TB SQL-Only 2.0.2 vs 2.1.0 1.6.3 vs 2.1.0 MLlib v1 vs v2 16
  • 17. BB 1GB-10TB Scalability SQL-only queries Hive Spark
  • 18. BigBench 1GB-1TB: Spark 2.0.2 vs 2.1.0 (CDP) Notes: Spark 2.1 a bit faster at small scales, slower at 100 GB and 1 TB on the UDF/NLP queries 2.1 faster up to 100GB Slower at 1TB
  • 19. BigBench 1GB-1TB: Spark 1.6.3 vs 2.1.0 MLlib 1 vs 2.1 MLlib 2(HDI) Notes: • Spark 2.1 is always faster than 1.6.3 in HDI • MLlilb 2 using dataframes over RDDs is only slightly faster than V1.
  • 20. Concurrency runs (throughput) SQL-only: 100GB – 1TB 512-core cluster 2020
  • 21. BB Throughput at 100GB 8 streams SQL-only (512-cores)
  • 22. BB Throughput at 1TB 4 streams SQL-only (512-cores)
  • 23. Conclusions • All providers have up to date (2.1.0) and well tuned versions of Spark • They could run BigBench up to 1TB on medium-sized cluster, • [Almost] Out-of-the box • Performance similar among providers for similar cluster types and disk configs • Difference according to scale (and pricing) • Spark 2.1.0 is faster than previous versions • Also MLlib 2 with dataframes • But improvements within the 30% range • Hive (+Tez + MLlib) are still slightly faster than Spark at lower scales for sequential • But Spark significantly faster at high data scales and concurrency • BigBench has been useful to stress a cluster with different workloads • Highlights config problems fast and stresses scale limits • Helpful for tuning the clusters 23
  • 24. Resources and references BigBench and ALOJA • BigBench Spark 2 branch (thanks Christoph and Michael from bankmark.de): • https://github.com/carabolic/Big-Data- Benchmark-for-Big-Bench/tree/spark2 • Original BigBench Implementation repository • https://github.com/intel-hadoop/Big-Data- Benchmark-for-Big-Bench • ALOJA benchmarking platform • https://github.com/Aloja/aloja • ALOJA fork of BigBench (adds support for HDI and fixes spark) • https://github.com/Aloja/Big-Data-Benchmark- for-Big-Bench Papers and slides • https://www.slideshare.net/ni_po • Characterizing TPCx-BB Queries, Hive, and Spark in Multi-Cloud Environments – N. Poggi et. Al • TPC-TC 2017 • The State of SQL-on-Hadoop in the Cloud – N. Poggi et. al. • IEEE Big Data 2016 • https://doi.org/10.1109/BigData.2016 .7840751 24
  • 25. Thanks, questions? Follow up / feedback : Npoggi@ac.upc.edu Twitter: ni_po The state of in the cloud ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ /_/
  • 27. BB 1TB Query 2 (M/R) providers comparison Hive Spark 27
  • 28. Spark config EMR CDP HDI Java version OpenJDK 1.8.0_121 OpenJDK 1.8.0_121 OpenJDK 1.8.0_131 Spark version 2.1.0 2.1 2.1.0.2.6.0.2-76 Driver memory 5G 5G 5G Executor memory 5G 10G 4G Executor cores 4 4 3 Executor instances Dynamic Dynamic 20 dynamicAllocation enabled TRUE TRUE FALSE Executor memoryOverhead Default (384MB) 1,117 MB Default (384MB) 28