SlideShare uma empresa Scribd logo
1 de 12
Baixar para ler offline
Platforms
Marat Zhanikeev
maratishe@gmail.com
maratishe.github.io
Hadoop versus Bigdata Replay
Tokyo Univ. of Science
O n P e r f o r m a n c e U n d e r H o t s p o t s i n
WebDB Forum 2017@お茶の水女子大
PDF → bit.do/170920
Background on Hadoop
• Hadoop performance measurement
◦ creators on performance limits 09
◦ superlinear effect 08
◦ various benchmarks on Hadoop vs Spark 07
◦ inconsistencies in measurements 11
• Hadoop/MapReduce optimization in 14 and a ton of other papers
• the ”Do We (actually) Need Hadoop?” argument in 10 and few recent
papers
09 K.Shvachko+0 ”HDFS scalability: the limits to growth” Usenix Login (2010)
08 N.Gunther+2 ”Hadoop Superlinear Scalability” ACM Queue (2015)
07 J.Shi+6 ”Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics” Very Large Data Bases (2015)
11 M.Xia+3 ”Performance Inconsistency in Large Scale Data Processing Clusters” 10th USENIX ICAC (2013)
14 A.Rasooli+1 ”COSHH: A Classiffication and Optimization based Scheduler for Heterogeneous Hadoop Systems” Future Gen.Comp.Sys. (2014)
10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 2/12
2/12
Modeling Hadoop Bottlenecks
Network
(NW)
Bulk
Storage
(BS)
Shared
Memory
(SM)
Core Output
Big Data Processing
HPC, Simulators, Modeling
Small
Data
Bulk
Storage
(BS)
On-Chip
Shared
Memory
(hSM)
Numberofparallelaccesses
Network
(NW)
Ability to isolate
Bottleneck
(pipe width)
RAM-based
Shared Memory
(sSM) Bulk
Storage
(BS)
Network
(NW)1
RAM-based
Shared Memory
(sSM)
Parallelaccesses
Ability to isolate
Core Output
Small
Data
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 3/12
3/12
Hadoop’s Answer: Rack Awareness
Rack
Switch
Datanode
Datanode
Datanode
…
Rack
Switch
Datanode
Datanode
……
Core
Switch
Client
Client
Logical
Client
Own Rack
Switch
Other Rack Switch
Other Rack Switch
Other Rack Switch
Datanodes
• official Hadoop feature
(not a bug) 12
• some dynamics, goes
off-rack when local
nodes have too many jobs
• sadly, manual
configuration of rack
affiliation (much potential here for
research on virtual network coordinates –
Meridian, Vivaldi...)
12 ”Hadoop: Rack Awareness” https://hadoop.apache.org (2017)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 4/12
4/12
Hadoop vs Bigdata Replay Method
• basic idea similar to 10 but uses circuits 02 to transfer shards and multicore
01 to parallel-process them
Name Node
Storage Node (shard)
file A
file B
file C
…
Hadoop Space
Manager
Hadoop Job
(your code)
Hadoop Job
(your code)
Hadoop Job
(your code)
MapReduce
job (your code)
manymany
Name
Server(s)
Client Machine
Hadoop Client
Your
Code
You
Start Use
Deploy
FindRead/parse
many
Internals (DC)
Users
Storage Node
(shard)
Time-Aware
Sub-Store(s)
Manager
Client Machine
Client
Your
Sketcher
You
Start Use
Schedule
Multicore
Replay
Replay Node
many
10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012)
02 myself+0 ”Circuit Emulation for Big Data Transfers in Clouds” Networking for Big Data, CRC (2015)
01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 5/12
5/12
Replay Environment is Highly Flexible!
• replay is time-aligned, so jobs can pick any spot on the timeline
• similar to Spark in going beyond key-value datatype but more – the full scope
of streaming algorithms 01
• massively multicore environments 04 with 100+ cores, dynamic re-packing of
job batches, etc.
Core 1
Core 1
Core X
Replay
Manager
Now(replay)
….
Time-Aligned Big Data
Cursor
Time
Direction
One Sketch One SketchOne Sketch
Start End End End
Read/prepare
Shared Memory
Start
….
Time
Now
(buffer head)
Manager
Job
Job
Buffer
tail
pos
pos
Controller
Kill
2 Report
Manage
in realtime
One Replay Batch
One
Buffer
One
Buffer
One
BufferJobs
Jobs
Jobs
Replay at
a scale
1
01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015)
04 myself+0 ”Volume and Irregularity Effects on Massively Multicore Packet Processors” APNOMS (2016)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 6/12
6/12
Performance under hotspots
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 7/12
7/12
The Hotspot Distribution
0 20 40 60 80 100
Decreasing order
0
0.35
0.7
1.05
1.4
1.75
2.1
2.45
2.8
log(value)
Class A Class B Class C Class D Class E
• models Flash/Hotspot/
Killerapp/Blackswan
events using extreme variance
in popularity
• generation method:
stick-breaking process,
Dirichlet distribution with
parallel beta sources 05
• final step: classify based on
the number of hot/flash items
05 myself+1 ”Popularity-Based Modeling of Flash Events in Synthetic Packet Traces” CQ研 (2012)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 8/12
8/12
The Binary ”Till Contention” Metric
• not a common, but very realistic
way to model performance under load
• note: even more applicable under
hotspot-y input
Rack
Rack
Border
(switch)
Client
Data
Shards
Data
Shards
…
Volume
Contention
Contention -free
to contention -ful
threshold
• example: function of server response
time to load can be expressed as:
T =
1
2
[
(L − n) +
√
(L − n)2 + k
1 − L
]
• ...where T is response time, L is load,
and k is the knee = contention point!
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 9/12
9/12
Performance Models
• shard size as S and in-job traffic to shard size ratio r
◦ so, Hadoop jobs generate rS versus always strictly S under Replay
• contention threshold as C (for both contention and/or capacity)
• list of shard hotness (popularity)
{
h1, h2, h3, ..., hn
}
and sizes{
S1, S2, S3, ...., Sn
}
• then we have (job/traffic) volume for Hadoop:
Vhadoop =
∑
i=1..n
rhiSi
• ... and for Replay method:
Vreplay =
∑
i=1..n
Si (1)
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 10/12
10/12
Results
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.7
1.4
2.1
2.8
3.5
4.2
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 10
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.8
1.6
2.4
3.2
4
4.8
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 50
A 0.001 B 0.001 C 0.001 D 0.001 E 0.001
hadoopreplay
A 0.005 B 0.005 C 0.005 D 0.005 E 0.005
A 0.01 B 0.01 C 0.01 D 0.01 E 0.01
A 0.05 B 0.05 C 0.05 D 0.05 E 0.05
A 0.1 B 0.1 C 0.1 D 0.1 E 0.1
A 0.2 B 0.2 C 0.2 D 0.2 E 0.2
10
20
50
100
200
500
1000
2000
5000
10000
0.9
1.8
2.7
3.6
4.5
5.4
log(1+timetillcontention)
A 0.5 B 0.5 C 0.5 D 0.5 E 0.5
Replay period (step) is 200
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 11/12
11/12
That’s all, thank you ...
M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 12/12
12/12

Mais conteúdo relacionado

Mais procurados

Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
Vamshikrishna Goud
 

Mais procurados (20)

Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP Introduction to Bigdata and HADOOP
Introduction to Bigdata and HADOOP
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Big Data simplified
Big Data simplifiedBig Data simplified
Big Data simplified
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Bigdata
Bigdata Bigdata
Bigdata
 

Semelhante a On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms

SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
DataWorks Summit
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Lviv Startup Club
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant ScaleHadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale
DataWorks Summit
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
lohitvijayarenu
 

Semelhante a On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms (20)

Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
 
SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases  SQL on Hadoop: Defining the New Generation of Analytics Databases
SQL on Hadoop: Defining the New Generation of Analytics Databases
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Replayable BigData for Multicore Processing and Statistically Rigid Sketching
Replayable BigData for Multicore Processing and Statistically Rigid SketchingReplayable BigData for Multicore Processing and Statistically Rigid Sketching
Replayable BigData for Multicore Processing and Statistically Rigid Sketching
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Talend For Big Data : Secret Key to Hadoop
Talend For Big Data  : Secret Key to HadoopTalend For Big Data  : Secret Key to Hadoop
Talend For Big Data : Secret Key to Hadoop
 
ETL using Big Data Talend
ETL using Big Data Talend  ETL using Big Data Talend
ETL using Big Data Talend
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
Yaroslav Ravlinko “Evolution of Data Processing platform from Hadoop to nowad...
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Hadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant ScaleHadoop 2 @ Twitter, Elephant Scale
Hadoop 2 @ Twitter, Elephant Scale
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
 
PDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plansPDE2011 pythonOCC project status and plans
PDE2011 pythonOCC project status and plans
 
Python and trending_data_ops
Python and trending_data_opsPython and trending_data_ops
Python and trending_data_ops
 

Mais de Tokyo University of Science

Mais de Tokyo University of Science (20)

A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
A Method for Cloud-Assisted Secure Wireless Grouping of Client Devices at Net...
 
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Ultrasound Relative Positioning for IoT Devices in Dense Wireless SpacesUltrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
Ultrasound Relative Positioning for IoT Devices in Dense Wireless Spaces
 
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
Towards a Packet Traffic Genome Project as a Method for Realtime Sub-Flow Tra...
 
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
What if We Atomize Student Data and Apps and Put Them on Docker Containers?What if We Atomize Student Data and Apps and Put Them on Docker Containers?
What if We Atomize Student Data and Apps and Put Them on Docker Containers?
 
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
Large-Scale Crowdsourcing by Vehicular Data Packets in a Sparse Roadside Infr...
 
Taking the Step from Software to Product Development \\ when teaching PBL at ...
Taking the Step from Software to Product Development \\ when teaching PBL at ...Taking the Step from Software to Product Development \\ when teaching PBL at ...
Taking the Step from Software to Product Development \\ when teaching PBL at ...
 
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
Design and Implementation of a 3-Party Cloud-Backed Handshake for Secure Grou...
 
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
The Switchboard Optimization Problem and Heuristics for Cut-Through NetworkingThe Switchboard Optimization Problem and Heuristics for Cut-Through Networking
The Switchboard Optimization Problem and Heuristics for Cut-Through Networking
 
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
The Switchboard Traffic Engineering Problem for Mixed Contention/Cut-Through ...
 
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless SpacesBulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
Bulk-n-Pick Method for One-to-Many Data Transfer in Dense Wireless Spaces
 
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness SpacesFog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
Fog Cloud Caching at Network Edge via Local Hardware Awareness Spaces
 
On a Hybrid Packets-and-Circuits Switching Logic
On a Hybrid Packets-and-Circuits Switching LogicOn a Hybrid Packets-and-Circuits Switching Logic
On a Hybrid Packets-and-Circuits Switching Logic
 
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Image-Related Uses for Roadside Infrastructure \\ based on Wireless BeaconsImage-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
Image-Related Uses for Roadside Infrastructure \\ based on Wireless Beacons
 
Complexity Resolution Control for Context Based on Metromaps
Complexity Resolution Control for Context Based on MetromapsComplexity Resolution Control for Context Based on Metromaps
Complexity Resolution Control for Context Based on Metromaps
 
The Declarative-Coordinated Model for Self-Optimization of Service Networks
The Declarative-Coordinated Model for Self-Optimization of Service NetworksThe Declarative-Coordinated Model for Self-Optimization of Service Networks
The Declarative-Coordinated Model for Self-Optimization of Service Networks
 
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
3-Way Scripts as a Practical Platform for Secure Distributed Code in Clouds
 
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
3-Way Scripts as a Base Unit for Flexible Scale-Out Code3-Way Scripts as a Base Unit for Flexible Scale-Out Code
3-Way Scripts as a Base Unit for Flexible Scale-Out Code
 
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Towards Social Robotics on Smartphones with Simple XYZV Sensor FeedbackTowards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
Towards Social Robotics on Smartphones with Simple XYZV Sensor Feedback
 
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
Back to Rings but not Tokens: Physical and Logical Designs for Distributed Fi...
 
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
Browser Visualization using PNGs Generated by HTML5 Workers on MulticoreBrowser Visualization using PNGs Generated by HTML5 Workers on Multicore
Browser Visualization using PNGs Generated by HTML5 Workers on Multicore
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms

  • 1. Platforms Marat Zhanikeev maratishe@gmail.com maratishe.github.io Hadoop versus Bigdata Replay Tokyo Univ. of Science O n P e r f o r m a n c e U n d e r H o t s p o t s i n WebDB Forum 2017@お茶の水女子大 PDF → bit.do/170920
  • 2. Background on Hadoop • Hadoop performance measurement ◦ creators on performance limits 09 ◦ superlinear effect 08 ◦ various benchmarks on Hadoop vs Spark 07 ◦ inconsistencies in measurements 11 • Hadoop/MapReduce optimization in 14 and a ton of other papers • the ”Do We (actually) Need Hadoop?” argument in 10 and few recent papers 09 K.Shvachko+0 ”HDFS scalability: the limits to growth” Usenix Login (2010) 08 N.Gunther+2 ”Hadoop Superlinear Scalability” ACM Queue (2015) 07 J.Shi+6 ”Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics” Very Large Data Bases (2015) 11 M.Xia+3 ”Performance Inconsistency in Large Scale Data Processing Clusters” 10th USENIX ICAC (2013) 14 A.Rasooli+1 ”COSHH: A Classiffication and Optimization based Scheduler for Heterogeneous Hadoop Systems” Future Gen.Comp.Sys. (2014) 10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 2/12 2/12
  • 3. Modeling Hadoop Bottlenecks Network (NW) Bulk Storage (BS) Shared Memory (SM) Core Output Big Data Processing HPC, Simulators, Modeling Small Data Bulk Storage (BS) On-Chip Shared Memory (hSM) Numberofparallelaccesses Network (NW) Ability to isolate Bottleneck (pipe width) RAM-based Shared Memory (sSM) Bulk Storage (BS) Network (NW)1 RAM-based Shared Memory (sSM) Parallelaccesses Ability to isolate Core Output Small Data M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 3/12 3/12
  • 4. Hadoop’s Answer: Rack Awareness Rack Switch Datanode Datanode Datanode … Rack Switch Datanode Datanode …… Core Switch Client Client Logical Client Own Rack Switch Other Rack Switch Other Rack Switch Other Rack Switch Datanodes • official Hadoop feature (not a bug) 12 • some dynamics, goes off-rack when local nodes have too many jobs • sadly, manual configuration of rack affiliation (much potential here for research on virtual network coordinates – Meridian, Vivaldi...) 12 ”Hadoop: Rack Awareness” https://hadoop.apache.org (2017) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 4/12 4/12
  • 5. Hadoop vs Bigdata Replay Method • basic idea similar to 10 but uses circuits 02 to transfer shards and multicore 01 to parallel-process them Name Node Storage Node (shard) file A file B file C … Hadoop Space Manager Hadoop Job (your code) Hadoop Job (your code) Hadoop Job (your code) MapReduce job (your code) manymany Name Server(s) Client Machine Hadoop Client Your Code You Start Use Deploy FindRead/parse many Internals (DC) Users Storage Node (shard) Time-Aware Sub-Store(s) Manager Client Machine Client Your Sketcher You Start Use Schedule Multicore Replay Replay Node many 10 A.Rowstron+1 ”Nobody ever got fired for using Hadoop on a cluster” 1st HotCDP (2012) 02 myself+0 ”Circuit Emulation for Big Data Transfers in Clouds” Networking for Big Data, CRC (2015) 01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 5/12 5/12
  • 6. Replay Environment is Highly Flexible! • replay is time-aligned, so jobs can pick any spot on the timeline • similar to Spark in going beyond key-value datatype but more – the full scope of streaming algorithms 01 • massively multicore environments 04 with 100+ cores, dynamic re-packing of job batches, etc. Core 1 Core 1 Core X Replay Manager Now(replay) …. Time-Aligned Big Data Cursor Time Direction One Sketch One SketchOne Sketch Start End End End Read/prepare Shared Memory Start …. Time Now (buffer head) Manager Job Job Buffer tail pos pos Controller Kill 2 Report Manage in realtime One Replay Batch One Buffer One Buffer One BufferJobs Jobs Jobs Replay at a scale 1 01 myself+0 ”Streaming Algorithms for Big Data Processing on Multicore” Big Data: Algorithms, Analytics, and Applications, CRC (2015) 04 myself+0 ”Volume and Irregularity Effects on Massively Multicore Packet Processors” APNOMS (2016) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 6/12 6/12
  • 7. Performance under hotspots M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 7/12 7/12
  • 8. The Hotspot Distribution 0 20 40 60 80 100 Decreasing order 0 0.35 0.7 1.05 1.4 1.75 2.1 2.45 2.8 log(value) Class A Class B Class C Class D Class E • models Flash/Hotspot/ Killerapp/Blackswan events using extreme variance in popularity • generation method: stick-breaking process, Dirichlet distribution with parallel beta sources 05 • final step: classify based on the number of hot/flash items 05 myself+1 ”Popularity-Based Modeling of Flash Events in Synthetic Packet Traces” CQ研 (2012) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 8/12 8/12
  • 9. The Binary ”Till Contention” Metric • not a common, but very realistic way to model performance under load • note: even more applicable under hotspot-y input Rack Rack Border (switch) Client Data Shards Data Shards … Volume Contention Contention -free to contention -ful threshold • example: function of server response time to load can be expressed as: T = 1 2 [ (L − n) + √ (L − n)2 + k 1 − L ] • ...where T is response time, L is load, and k is the knee = contention point! M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 9/12 9/12
  • 10. Performance Models • shard size as S and in-job traffic to shard size ratio r ◦ so, Hadoop jobs generate rS versus always strictly S under Replay • contention threshold as C (for both contention and/or capacity) • list of shard hotness (popularity) { h1, h2, h3, ..., hn } and sizes{ S1, S2, S3, ...., Sn } • then we have (job/traffic) volume for Hadoop: Vhadoop = ∑ i=1..n rhiSi • ... and for Replay method: Vreplay = ∑ i=1..n Si (1) M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 10/12 10/12
  • 11. Results A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.7 1.4 2.1 2.8 3.5 4.2 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 10 A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.8 1.6 2.4 3.2 4 4.8 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 50 A 0.001 B 0.001 C 0.001 D 0.001 E 0.001 hadoopreplay A 0.005 B 0.005 C 0.005 D 0.005 E 0.005 A 0.01 B 0.01 C 0.01 D 0.01 E 0.01 A 0.05 B 0.05 C 0.05 D 0.05 E 0.05 A 0.1 B 0.1 C 0.1 D 0.1 E 0.1 A 0.2 B 0.2 C 0.2 D 0.2 E 0.2 10 20 50 100 200 500 1000 2000 5000 10000 0.9 1.8 2.7 3.6 4.5 5.4 log(1+timetillcontention) A 0.5 B 0.5 C 0.5 D 0.5 E 0.5 Replay period (step) is 200 M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 11/12 11/12
  • 12. That’s all, thank you ... M.Zhanikeev – maratishe@gmail.com On Performance Under Hotspots in Hadoop versus Bigdata Replay Platforms – bit.do/170920 12/12 12/12