SlideShare uma empresa Scribd logo
1 de 26
1
How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
Ahsan Javed Awan
EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)
Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard
Ayguade(UPC and BSC),
2
Motivation
Why should we care about architecture support?
*Source: SGI
Data Growing Faster Than Technology
4
Motivation
Cont...
Our FocusOur Focus
Improve the node level performance
through architecture support
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,
Metis, Ostrich,
etc..
Hadoop, Spark,
Flink, etc..
5
Motivation
Conti...
● A mismatch between the characteristics of emerging workloads and the underlying
hardware.
– M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads
on modern hardware,” in ASPLOS 2012.
– Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC
2013.
– Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014
– A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in
IISWC 2014.
– T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in
IISWC 2014
Existing studies lack quantitative analysis of bottlenecks of
scale-out frameworks on single-node
6
Progress Meeting 12-12-14
Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]
7
Our Approach
● Performance characterization of in-memory data analytics on a
modern cloud server,” in 5th
International IEEE Conference on
Big Data and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server
What are the major bottlenecks??
Focus of this talk
8
Our Approach
● Do Spark based data analytics benefit from using scale-up
servers?
● How severe is the impact of garbage collection on performance
of Spark based data analytics?
● Is file I/O detrimental to Spark based data analytics
performance?
● How does data size affect the micro-architecture performance
of Spark based data analytics?
What are the remaining questions??
9
Our Approach
● We evaluate the impact of data volume on the performance of
Spark based data analytics running on a scale-up server.
● We quantify the limitations of using Spark on a scale-up server
with large volumes of data.
● We quantify the variations in micro-architectural performance of
applications across different data volumes.
What are the contributions??
10
Our Approach
● Use a subset of benchmarks from BigDataBench
● Use Big Data Generator Suite (BDGS), to generate synthetic
datasets of 6 GB, 12 GB and 24 GB.
● Configure Spark in local mode and tune its internal Parameters
● Rely on GC logs to collect garbage collection times.
● Use Spark logs to gather execution time of benchmarks.
● Use Concurrency Analysis in Intel Vtune to collect wait time and CPU
time of executor pool threads
● Use General Micro-architectural Exploration in Intel Vtune to analyze
impact of data volume on micro-architecture characteristics.
Methodology
11
Our Approach
What are the characteristics of benchmarks?
12
Our Hardware Configuration
System Details
13
Our Hardware Configuration
Machine Details
Hyper Threading and Turbo-boost are disabled
Hyper Threading and Turbo-boost are disabled
14
Our Approach
Software Parameters
15
Motivation
Do Spark based data analytics benefit from using larger
scale-up servers?
Spark applications do not benefit significantly by using more than 12-core executors
16
Motivation
Is GC detrimental to scalability of Spark applications?
The proportion of GC time increases with the number of cores
17
Motivation
Does performance remain consistent as we enlarge the data
size ?
Decrease in Data processed per second ranges from 11% to 93% ( Parallel Scavenge)
18
Motivation
Does the choice of Garbage Collector impact the data
processing capability of the system ??
Improvement in DPS ranges from 1.4x to 3.7x on average
in Parallel Scavenge as compared to G1
19
Motivation
How does GC affect data processing capability of
the system ??
GC time does not scale linearly with data size.
20
Motivation
How does CPU utilization scale with data volume ?
CPU Utilization decreases with increase in input data size
21
Motivation
Is File I/O detrimental to performance ?
Fraction of file I/O increases by 6x, 18x and 25x for Word Count,
Naive Bayes and Sort respectively when input data is increased by 4x
22
Motivation
How does data size affects micro-architectural
performance ?
5 to 10 % better instruction retirement as we enlarge the data size
23
Motivation
Cont..
Execution units inside the core exhibit improved utilization at larger data sets
24
Motivation
Cont..
Increase in L1 Bound Stalls implies better utilization of L1 Caches
25
Motivation
Cont..
Spark benchmarks exhibit reduced memory bandwidth utilization
26
Key Findings
● Spark workloads do not benefit significantly from executors with
more than 12 cores.
● The performance of Spark workloads degrades with large volumes
of data due to substantial increase in garbage collection and file
I/O time.
● With out any tuning, Parallel Scavenge garbage collection scheme
outperforms Concurrent Mark Sweep and G1 garbage collectors
for Spark workloads.
● Spark workloads exhibit improved instruction retirement due to
lower L1 cache misses and better utilization of functional units
inside cores at large volumes of data.
● Memory bandwidth utilization of Spark benchmarks decreases
with large volumes of data and is 3x lower than the available off-
chip bandwidth on our test machine
27
Motivation
Future Directions
NUMA Aware Task Scheduling
Cache Aware Transformations
Exploiting Processing In Memory Architectures
HW/SW Data Prefectching
Rethinking Memory Architectures

Mais conteúdo relacionado

Mais procurados

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitGanesan Narayanasamy
 
MSc Big Data: Connectomics Talk
MSc Big Data: Connectomics TalkMSc Big Data: Connectomics Talk
MSc Big Data: Connectomics TalkJohn Houston
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark Summit
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergencekoolkalpz
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudIJAAS Team
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Cloudera, Inc.
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsthinkASG
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluationGIORGOS STAMELOS
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to sparksteccami
 
Real time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationReal time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationLeMeniz Infotech
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...ijcsit
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabaseKinetica
 

Mais procurados (20)

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
Scientific Application Development and Early results on Summit
Scientific Application Development and Early results on SummitScientific Application Development and Early results on Summit
Scientific Application Development and Early results on Summit
 
MSc Big Data: Connectomics Talk
MSc Big Data: Connectomics TalkMSc Big Data: Connectomics Talk
MSc Big Data: Connectomics Talk
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
Big Data and its emergence
Big Data and its emergenceBig Data and its emergence
Big Data and its emergence
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
CDSS
CDSSCDSS
CDSS
 
Data Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with CloudData Partitioning in Mongo DB with Cloud
Data Partitioning in Mongo DB with Cloud
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
IBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deploymentsIBM POWER - An ideal platform for scale-out deployments
IBM POWER - An ideal platform for scale-out deployments
 
Performance and Energy evaluation
Performance and Energy evaluationPerformance and Energy evaluation
Performance and Energy evaluation
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
From hadoop to spark
From hadoop to sparkFrom hadoop to spark
From hadoop to spark
 
Real time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing applicationReal time big data analytical architecture for remote sensing application
Real time big data analytical architecture for remote sensing application
 
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
CYBER INFRASTRUCTURE AS A SERVICE TO EMPOWER MULTIDISCIPLINARY, DATA-DRIVEN S...
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 

Destaque

China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...Qianzhan Intelligence
 
China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...Qianzhan Intelligence
 
China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...Qianzhan Intelligence
 
We Need More Legal Hackers Now!
We Need More Legal Hackers Now!We Need More Legal Hackers Now!
We Need More Legal Hackers Now!RightBrainLaw
 
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL martha calderon
 
China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...Qianzhan Intelligence
 
China banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportChina banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportQianzhan Intelligence
 
China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...Qianzhan Intelligence
 
China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...Qianzhan Intelligence
 
COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS martha calderon
 
Open source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of productionOpen source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of productionLouis Florin
 

Destaque (20)

China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...China desulfurization equipment industry market research and investment forec...
China desulfurization equipment industry market research and investment forec...
 
Sonic
Sonic Sonic
Sonic
 
China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...China investment attracting pattern and regional promotion planning report, 2...
China investment attracting pattern and regional promotion planning report, 2...
 
China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...China pharmaceutical excipients industry indepth research and investment stra...
China pharmaceutical excipients industry indepth research and investment stra...
 
We Need More Legal Hackers Now!
We Need More Legal Hackers Now!We Need More Legal Hackers Now!
We Need More Legal Hackers Now!
 
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA  Y LA IDENTIDAD INSTITUCIONAL
CAMAPAÑA PARA DIFUNDIR EL PACTO DE CONVIVENCIA Y LA IDENTIDAD INSTITUCIONAL
 
China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...China coal industry development trend and investment strategic decision repor...
China coal industry development trend and investment strategic decision repor...
 
China banking industry market research and prospect forecast report
China banking industry market research and prospect forecast reportChina banking industry market research and prospect forecast report
China banking industry market research and prospect forecast report
 
China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...China high end equipment manufacturing park development pattern and investmen...
China high end equipment manufacturing park development pattern and investmen...
 
Business plan
Business planBusiness plan
Business plan
 
Problem 1
Problem 1Problem 1
Problem 1
 
Genre research
Genre researchGenre research
Genre research
 
AppsNgen
AppsNgenAppsNgen
AppsNgen
 
China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...China automated warehouse industry investment demand and development prospect...
China automated warehouse industry investment demand and development prospect...
 
Tugas B.Inggris Pekan 1
Tugas B.Inggris Pekan 1Tugas B.Inggris Pekan 1
Tugas B.Inggris Pekan 1
 
COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS COMO TRABAJO Y APLICO MIS COMPETENCIAS
COMO TRABAJO Y APLICO MIS COMPETENCIAS
 
Welcome to 5th grade
Welcome to 5th gradeWelcome to 5th grade
Welcome to 5th grade
 
Open source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of productionOpen source as a convivial and democratic mode of production
Open source as a convivial and democratic mode of production
 
South Korea
South KoreaSouth Korea
South Korea
 
Docker^3
Docker^3Docker^3
Docker^3
 

Semelhante a How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Ahsan Javed Awan
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLDESMOND YUEN
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET Journal
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!TigerGraph
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkAhsan Javed Awan
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Spark Summit
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at ScaleNeo4j
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesJun Liu
 

Semelhante a How Data Volume Affects Spark Based Data Analytics on a Scale-up Server (20)

Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
Micro-architectural Characterization of Apache Spark on Batch and Stream Proc...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
Spark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed AwanSpark Summit EU talk by Ahsan Javed Awan
Spark Summit EU talk by Ahsan Javed Awan
 
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
IRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop FrameworkIRJET- Big Data Processes and Analysis using Hadoop Framework
IRJET- Big Data Processes and Analysis using Hadoop Framework
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
 
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Strata + Hadoop 2015 Slides
Strata + Hadoop 2015 SlidesStrata + Hadoop 2015 Slides
Strata + Hadoop 2015 Slides
 

Último

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxviniciusperissetr
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一z xss
 

Último (20)

Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
SWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptxSWOT Analysis Slides Powerpoint Template.pptx
SWOT Analysis Slides Powerpoint Template.pptx
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
办理(UC毕业证书)堪培拉大学毕业证成绩单原版一比一
 

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

  • 1. 1 How Data Volume Affects Spark Based Data Analytics on a Scale-up Server Ahsan Javed Awan EMJD-DC (KTH-UPC) (https://www.kth.se/profile/ajawan/) Mats Brorsson(KTH), Vladimir Vlassov(KTH) and Eduard Ayguade(UPC and BSC),
  • 2. 2 Motivation Why should we care about architecture support? *Source: SGI Data Growing Faster Than Technology
  • 3. 4 Motivation Cont... Our FocusOur Focus Improve the node level performance through architecture support *Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/ Phoenix ++, Metis, Ostrich, etc.. Hadoop, Spark, Flink, etc..
  • 4. 5 Motivation Conti... ● A mismatch between the characteristics of emerging workloads and the underlying hardware. – M. Ferdman et-al, “Clearing the clouds: A study of emerging scale-out workloads on modern hardware,” in ASPLOS 2012. – Z. Jia, et-al “Characterizing data analysis workloads in data centers,” in IISWC 2013. – Z. Jia et-al, “Characterizing and subsetting big data workloads,” in IISWC 2014 – A. Yasin et-al, “Deep-dive analysis of the data analytics workload in cloudsuite,” in IISWC 2014. – T. Jiang, et-al, “Understanding the behavior of in-memory computing workloads,” in IISWC 2014 Existing studies lack quantitative analysis of bottlenecks of scale-out frameworks on single-node
  • 5. 6 Progress Meeting 12-12-14 Which Scale-out Framework ? [Picture Courtesy: Amir H. Payberah]
  • 6. 7 Our Approach ● Performance characterization of in-memory data analytics on a modern cloud server,” in 5th International IEEE Conference on Big Data and Cloud Computing, 2015 (Best Paper Award). ● How Data Volume Affects Spark Based Data Analytics on a Scale-up Server What are the major bottlenecks?? Focus of this talk
  • 7. 8 Our Approach ● Do Spark based data analytics benefit from using scale-up servers? ● How severe is the impact of garbage collection on performance of Spark based data analytics? ● Is file I/O detrimental to Spark based data analytics performance? ● How does data size affect the micro-architecture performance of Spark based data analytics? What are the remaining questions??
  • 8. 9 Our Approach ● We evaluate the impact of data volume on the performance of Spark based data analytics running on a scale-up server. ● We quantify the limitations of using Spark on a scale-up server with large volumes of data. ● We quantify the variations in micro-architectural performance of applications across different data volumes. What are the contributions??
  • 9. 10 Our Approach ● Use a subset of benchmarks from BigDataBench ● Use Big Data Generator Suite (BDGS), to generate synthetic datasets of 6 GB, 12 GB and 24 GB. ● Configure Spark in local mode and tune its internal Parameters ● Rely on GC logs to collect garbage collection times. ● Use Spark logs to gather execution time of benchmarks. ● Use Concurrency Analysis in Intel Vtune to collect wait time and CPU time of executor pool threads ● Use General Micro-architectural Exploration in Intel Vtune to analyze impact of data volume on micro-architecture characteristics. Methodology
  • 10. 11 Our Approach What are the characteristics of benchmarks?
  • 12. 13 Our Hardware Configuration Machine Details Hyper Threading and Turbo-boost are disabled Hyper Threading and Turbo-boost are disabled
  • 14. 15 Motivation Do Spark based data analytics benefit from using larger scale-up servers? Spark applications do not benefit significantly by using more than 12-core executors
  • 15. 16 Motivation Is GC detrimental to scalability of Spark applications? The proportion of GC time increases with the number of cores
  • 16. 17 Motivation Does performance remain consistent as we enlarge the data size ? Decrease in Data processed per second ranges from 11% to 93% ( Parallel Scavenge)
  • 17. 18 Motivation Does the choice of Garbage Collector impact the data processing capability of the system ?? Improvement in DPS ranges from 1.4x to 3.7x on average in Parallel Scavenge as compared to G1
  • 18. 19 Motivation How does GC affect data processing capability of the system ?? GC time does not scale linearly with data size.
  • 19. 20 Motivation How does CPU utilization scale with data volume ? CPU Utilization decreases with increase in input data size
  • 20. 21 Motivation Is File I/O detrimental to performance ? Fraction of file I/O increases by 6x, 18x and 25x for Word Count, Naive Bayes and Sort respectively when input data is increased by 4x
  • 21. 22 Motivation How does data size affects micro-architectural performance ? 5 to 10 % better instruction retirement as we enlarge the data size
  • 22. 23 Motivation Cont.. Execution units inside the core exhibit improved utilization at larger data sets
  • 23. 24 Motivation Cont.. Increase in L1 Bound Stalls implies better utilization of L1 Caches
  • 24. 25 Motivation Cont.. Spark benchmarks exhibit reduced memory bandwidth utilization
  • 25. 26 Key Findings ● Spark workloads do not benefit significantly from executors with more than 12 cores. ● The performance of Spark workloads degrades with large volumes of data due to substantial increase in garbage collection and file I/O time. ● With out any tuning, Parallel Scavenge garbage collection scheme outperforms Concurrent Mark Sweep and G1 garbage collectors for Spark workloads. ● Spark workloads exhibit improved instruction retirement due to lower L1 cache misses and better utilization of functional units inside cores at large volumes of data. ● Memory bandwidth utilization of Spark benchmarks decreases with large volumes of data and is 3x lower than the available off- chip bandwidth on our test machine
  • 26. 27 Motivation Future Directions NUMA Aware Task Scheduling Cache Aware Transformations Exploiting Processing In Memory Architectures HW/SW Data Prefectching Rethinking Memory Architectures