Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

1
Micro-architectural Characterization of
Apache Spark on Batch and Stream
Processing Workloads
Ahsan Javed Awan
EMJD-DC (KTH-UPC)
(https://www.kth.se/profile/ajawan/)
Mats Brorsson(KTH), Eduard Ayguade(UPC and BSC),
Vladimir Vlassov(KTH)

2
Motivation
Why should we care about architecture support?
*Taken from Babak's slides
Data Growing Faster Than Technology

3
Motivation
Cont...
Our GoalOur Goal
Improve the node level performance
through architecture support
*Source: http://navcode.info/2012/12/24/cloud-scaling-schemes/
Phoenix ++,
Metis, Ostrich,
etc..
Hadoop, Spark,
Flink, etc..

4
Our Approach
● Performance characterization of in-memory data analytics on a
modern cloud server, in 5th International IEEE Conference on Big
Data and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a
Scale-up Server in 6th International Workshop on Big Data
Benchmarks, Performance Optimization and Emerging Hardware
(BpoE), held in conjunction with VLDB 2015, Hawaii, USA
– Limited to batch processing workloads only
– Does not consider the velocity aspect of big data
– Experiments are based on older version of Spark.
What are the major performance
bottlenecks??

5
Our Approach
● Does micro-architectural performance remains consistent
across batch and stream processing workloads ?
● How Data-frames micro-architecturally compare to RDDs ?
● How data velocity affect the micro-architectural performance ?
What are the remaining questions??

6
Progress Meeting 12-12-14
Which Scale-out Framework ?
[Picture Courtesy: Amir H. Payberah]
● Tuning of Spark internal Parameters
● Tuning of JVM Parameters (Heap size etc..)
● Micro-architecture Level Analysis using Hardware Performance
Counters.

7
Our Approach
Which benchmarks?

8
Our Hardware Configuration
Which Machine ?
Hyper Threading and Turbo-boost are disabled
Intel's Ivy Bridge Server

9
Does micro-architectural performance remains
consistent ?
Stream processing is micro-architecturally similar to batch processing in Spark

10
Cont..
Stream processing is micro-architecturally similar to batch processing in Spark

11
Cont..
Streaming workloads with similar Spark transformations have different
micro-architectural behavior

12
Cont..

13
Cont..

14
Cont..
Workload Spark Transformation Input
data
rate
Window
size (s)
Working Set with
2s sampling
interval
WWc FlatMap, Map,
ReduceByKeyAndWindow
10^4 30 15 x 10^4
CSpc FlatMap, Map,
CountByValueAndWindow
10^4 10 5 x 10^4
CErpz FlatMap, Map, Window,
GroupByKey
10^4 30 15 x 10^4
CAuC FlatMap, Map, Window,
GroupByKey, Count
10^4 10 5 x 10^4
Tpt FlatMap,
ReduceByKeyAndWindow,
Transform
10^1 60 30 x 10^1
Micro-batch size determines the micro-architectural behavior of stream processing
workloads with similar Spark transformations

15
Do Dataframes perform better than RDDs at
micro-architectural level?
DataFrame exhibit 25% less back-end bound stalls 64% less DRAM bound stalled cycles
25% less BW consumption10% less starvation of execution resources
Dataframes have better micro-architectural performance than RDDs

16
How Data Velocity affect micro-architectural
performance?
Better CPU utilization at higher data velocity

17
Cont..
Higher instruction retirement at higher data velocity Higher L1-Bound stalls at higher data velocity
Less starvation at higher data velocity Higher BW consumption at higher velocity

18
Our Approach
Conclusion
● Batch processing and stream processing has same micro-architectural
behavior in Spark if the difference between two implementations is of
micro-batching only.
● Spark workloads using DataFrames have improved instruction
retirement over workloads using RDDs.
● If the input data rates are small, stream processing workloads are
front-end bound. However, the front end bound stalls are reduced at
larger input data rates and instruction retirement is improved.

20
Our Approach
List of Papers
● Performance characterization of in-memory data analytics on a
modern cloud server, in 5th
International IEEE Conference on Big Data
and Cloud Computing, 2015 (Best Paper Award).
● How Data Volume Affects Spark Based Data Analytics on a Scale-up
Server in 6th
International Workshop on Big Data Benchmarks,
Performance Optimization and Emerging Hardware (BpoE), held in
conjunction with VLDB 2015, Hawaii, USA .
● Micro-architectural Characterization of Apache Spark on Batch and
Stream Processing Workloads. (accepted to BDCloud 2016)
● Node Architecture Implications for In-Memory Data Analytics in Scale-
in Clusters (accepted to IEEE BDCAT 2016)
● Implications of In-Memory Data Analytics with Apache Spark on Near
Data Computing Architectures (under submission).

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (17)

Semelhante a Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads

Semelhante a Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads (20)

Último

Último (20)

Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads