SlideShare a Scribd company logo
1 of 44
Download to read offline
Taming latency:
case studies in
MapReduce data
analytics
Simon Tao
EMC Labs China
Office of the CTO

© Copyright 2013 EMC Corporation. All rights reserved.

1
Roadmap Information Disclaimer
 EMC makes no representation and undertakes no obligations with
regard to product planning information, anticipated product
characteristics, performance specifications, or anticipated release
dates (collectively, “Roadmap Information”).
 Roadmap Information is provided by EMC as an accommodation to the
recipient solely for purposes of discussion and without intending to be
bound thereby.
 Roadmap information is EMC Restricted Confidential and is provided
under the terms, conditions and restrictions defined in the EMC NonDisclosure Agreement in place with your organization.

© Copyright 2013 EMC Corporation. All rights reserved.

2
Agenda
 Motivation
 Combating latency in general
 Latency reducing approach for MR
 Case studies for MR with focus on low
latency
 Summary

© Copyright 2013 EMC Corporation. All rights reserved.

3
Introduction
 What this presentation is about

– Approaches that improve performance by enhancing
existing MapReduce platform
– Focus on per-job latency in wall-clock time, among other
performance metrics
– With case studies from both academia and industry

 What it is not about

– Performance improvement by manipulating MapReduce
framework tuning knobs

© Copyright 2013 EMC Corporation. All rights reserved.

4
Low latency: Motivations
 Faster decision-making

– Fraud detection, system monitoring, trending topic
identification

 Interactivity

– Targeted advertising, personalized news feeds, Online
recommendations

 “pay-as-you-go” services

– Economic advantage in “pay-as-you-go” billing model

© Copyright 2013 EMC Corporation. All rights reserved.

5
Sources of latency
Computer science is a thousand layers of abstraction

 Latency is everywhere

– Hardware Infrastructure: Processors,
Memory, Storage I/O, Network I/O
– Software Infrastructure: OS Kernel,
JVM, Server software
– Architectural design and system
implementation
– Communication protocol: DNS, TCP

© Copyright 2013 EMC Corporation. All rights reserved.

— One colleague of Shimon Schocken

I see latency.... They're everywhere.

6
Combating latency by extrapolation
 Approach to minimize latency for systems in general
– Address every latency bottleneck in system
– Minimize it's latency contribution

 Apply latency minimizing approach to MapReduce

– What are the layers in MapReduce data processing stack?
– How can the latency contributions from them be mitigated?

© Copyright 2013 EMC Corporation. All rights reserved.

7
MapReduce Recap: logical view
 Map(k1,v1) → list(k2,v2)

– User defined Map function that processes a key/value pair
to generate a set of intermediate key/value pairs

 Reduce(k2, list (v2)) → list(v2)

– Reduce function merges all values associated with the
same intermediate key

 Simple, yet expressive

– Real world applications: Word Count, Distributed Grep,
Count of URL Access Frequency , Inverted Index, etc

© Copyright 2013 EMC Corporation. All rights reserved.

8
MapReduce illustrated
Map

Shuffle
1
1
1
1

1
1
1
1
1
1
© Copyright 2013 EMC Corporation. All rights reserved.

1
1

Reduce
3

1
1
1

2
33
22

1
1

3

1
1
1

2

9
MapReduce Recap: system view
 Embarrassingly parallel

– Partitioned parallelism in both Map and Reduce phases

 Distributed and scalable

– Computations distributed across large cluster of
commodity machines
– Master schedules tasks to workers

 Fault tolerant

– Reschedule task in case of failure
– Materialize task output to disk

 Performance Optimized

– Combiner function
– Locality-aware scheduling
– Redundant execution

© Copyright 2013 EMC Corporation. All rights reserved.

10
MR latency mitigation: a systematic way
 Latency improvement opportunities in aspects from
the whole MR processing stack
– Architectural design
▪ HOP

– Programming model
▪ S4

– Resource scheduling
▪ Delay scheduling

– Dataflow: processing and
transmission

▪ Spark, Tenzing, Bolt MR, etc

– Data persistence
▪ Stinger

© Copyright 2013 EMC Corporation. All rights reserved.

11
Trade-offs

Every good quality is noxious if unmixed
— Ralph Waldo Emerson

 Latency, sometimes at odds with throughput
– Speculative execution

▪ Backup executions of “straggler” tasks decrease per-job
latency at the expense of cluster throughput

 Trade-off between latency and fault tolerance
– Naïve pipelining

▪ Direct output transmission from Mapper to Reducer
alleviates latency bottleneck, but hurts fault tolerance

 Need to preserve other critical system
characteristics
– Throughput, fault tolerance, scalability…

© Copyright 2013 EMC Corporation. All rights reserved.

12
Case Studies
Approach to mitigate latency,
from HOP, Tenzing, S4, Spark,
Stinger and LUMOS

© Copyright 2013 EMC Corporation. All rights reserved.

13
HOP: Hadoop Online Prototype
 A pipelining version of Hadoop from UC Berkeley
– “MapReduce Online”, NSDI'10 paper
– Open sourced with Apache License 2.0

 In HOP’s modified MapReduce architecture,
intermediate data is pipelined between operators
 HOP preserves the programming interfaces and fault
tolerance models of previous MapReduce
frameworks
© Copyright 2013 EMC Corporation. All rights reserved.

14
Stock Hadoop: a blocking architecture
 Intermediate data produced
by each Mapper is pulled by
Reducer in its entirety
– Simplified fault tolerance

▪ Data output are materialized
before consumption

– Underutilized resource:
▪ Completely decoupled execution
between Mapper and Reducer

© Copyright 2013 EMC Corporation. All rights reserved.

15
HOP: from blocking to pipelining
 HOP offers a modified MapReduce architecture that
allows data to be pipelined between operators

– Improved system utilization and reduced completion times
with increased parallelism
– Extends programming model beyond batch processing
▪ Online aggregation
— Allows users to see “early returns” from a job as it is being computed

▪ Continuous queries
— Enable applications such as event monitoring and stream processing

© Copyright 2013 EMC Corporation. All rights reserved.

16
Latency decreasing in HOP
 Challenge: Latency backfire

– Increased job response time resulting from eager pipelining
▪ Eager pipelining prevents use of “combiner” optimization
▪ Reducer may be overloaded by shifted sorting work from Mappers

 Solution: Adaptive load moving

1. Buffer the output, with a threshold size in Mapper
2. On filled buffer, apply combiner function, sort and spill
output to disk
3. Spill files are pipelined to reduce tasks adaptively
▪

Accumulated spill files may be further merged

© Copyright 2013 EMC Corporation. All rights reserved.

17
Preserving fault tolerance in HOP
 Challenges:

– Reducer failure

▪ Make fault tolerance difficult in purely pipelined architecture

– Mapper failure

▪ Limit on the reducer’s ability to merge spill files

 Solution:

– Materialization

▪ The intermediate data are materialized, retaining fault tolerance in Hadoop

– Checkpointing

▪ The reached offset in Mapper input split is bookkept
▪ Only Mapper output produced before the offset is merged by Reducer

© Copyright 2013 EMC Corporation. All rights reserved.

18
Performance evaluation from HOP
 Some initial performance results disclose that pipelining can
reduce job completion times by up to 25% in some scenarios
– Word-count on 10GB input data, 20 map tasks and 20 reduce tasks
– CDF of Map and Reduce task completion times for Blocking and
Pipelining, respectively
– Pipelining reduces total job runtimes by 19.7%

© Copyright 2013 EMC Corporation. All rights reserved.

19
Tenzing: Hive the Google way
 SQL query engine on top of MapReduce for ad hoc
data analysis from Google
– “Tenzing A SQL Implementation On The MapReduce
Framework”, VLDB'11 paper
– Featured by:
▪
▪
▪
▪

Strong SQL support
Low latency, comparable with parallel databases
Highly scalable and reliable, atop MapReduce
Support heterogeneous backend storage

© Copyright 2013 EMC Corporation. All rights reserved.

20
Low latency approaches in Tenzing
 MR execution enhancement
– Process pool

▪ Master pool
▪ Worker pool

– Streaming and In-memory Chaining
– Sort Avoidance for certain hash based operators
▪ Block Shuffle

– Local Execution

 SQL Query enhancement

– Metadata-aware query plan optimization
– Projection and Filtering, Aggregation, Joins, etc

 Experimental Query Engine optimization
– LLVM query engine

© Copyright 2013 EMC Corporation. All rights reserved.

21
Tenzing performance
 “Using this approach, we were able to bring down
the latency of the execution of a Tenzing query itself
to around 7seconds.”
 “There are other bottlenecks in the system however,
such as computation of map splits, updating the
metadata service, …, etc. which means the typical
latency varies between 10 and 20 seconds currently.”

© Copyright 2013 EMC Corporation. All rights reserved.

22
S4: Simple Scalable Streaming System
 A research project for stream processing in Yahoo!

– Open sourced in Sep, 2009 and entered Apache Incubation
Oct 2011
– A general-purpose stream processing engine
▪ With a simple programming interface
▪ Distributed and scalable
▪ Partially fault-tolerant

– Design for use cases different from batch model processing
▪ Infinite data stream
▪ Stream of events that flow into the system at variety data rate
▪ Real-time processing with low latency expected

© Copyright 2013 EMC Corporation. All rights reserved.

23
S4 overview
 Data abstraction

– Data are streams of key-value,
dispatched and processed by
Processing Elements

 Design inspired by
– Actors model
– MapReduce model

▪ key-value based data dispatching

© Copyright 2013 EMC Corporation. All rights reserved.

TopK, stream processing

24
Low latency design in S4
 Simple programming paradigm that operates on
data streams in real-time
 Minimize latency by using local memory to avoid
disk I/O bottlenecks
– Lossy failover: Partially fault tolerant

 Pluggable architecture to select network protocol for
data communication
– Communication layer allows data be sent without a
guarantee in trade for performance

© Copyright 2013 EMC Corporation. All rights reserved.

25
Spark
 Research project at UC Berkeley on big data analytics

– “Spark: Cluster Computing with Working Sets”, HotCloud'10

 A parallel cluster computing framework
– Supports applications with working sets
▪ Iterative algorithm
▪ Interactive data analysis

– Retaining the scalability and fault tolerance of MapReduce

 Allow interactive large data analyzing on clusters
efficiently, with a general purpose programming language

© Copyright 2013 EMC Corporation. All rights reserved.

26
Latency decreasing in Spark
 In Spark, data can be cached in memory
explicitly

– The core data abstraction for Spark is RDD, the readonly, partitioned collection of objects

 Keeping working set of data in memory can
improve performance by an order of magnitude
– Outperform Hadoop by 20 for iterative jobs
– Can be used interactively to search a 1 TB dataset
with latencies of 5–7 seconds

© Copyright 2013 EMC Corporation. All rights reserved.

27
Fault tolerance in Spark
 Lineage

– Lost partitions are recovered by
‘replaying’ the series of
transformations used to build the
RDD

 Checkpointing

– To avoid time-consuming recovery,
checkpoint to stable storage will be
helpful to applications with
▪ Long lineage graph
▪ Lineage composed of
wide dependencies

© Copyright 2013 EMC Corporation. All rights reserved.

28
Stinger Initiative
 Enhance Hive with more SQL and improved
performance to allow human-time use cases

– Announced in Feb 2013, led by Hortonworks
– Effort from community collaboration, with resources from
SAP, Microsoft, Facebook and Hortonworks

© Copyright 2013 EMC Corporation. All rights reserved.

29
Making Apache Hive 100 Times Faster
 Stinger’s improvements on HIVE
– More SQL

▪ Analytics features, standard SQ aligning, etc

– Optimized query execution plans
▪ 45X performance increase for Hive in some early results

– Support of new columnar file format
▪ ORCFile, more efficiency and high performance

– New runtime framework, Apache Tez

© Copyright 2013 EMC Corporation. All rights reserved.

30
Accelerating data processing by Tez
 In traditional MapReduce, one SQL query
often results in multiple jobs, which
eventually impacts performance

– Latency introduced from launching of jobs
– Extra overhead in materializing intermediate
job outputs to the file system

 Performance improvements from Tez

– With a generalized computing paradigm for
DAG execution, Tez can express any SQL as one
single job
– Tez AM, running atop YARN, supports container
reuse

© Copyright 2013 EMC Corporation. All rights reserved.

31
LUMOS Project
 A real-time, interactive, self-service
data cloud platform for big data
analytics, from EMC Labs China
 LUMOS – guide the data scientists to
the big value of big data
 Goal: Develop key building blocks for
the big data cloud platform

© Copyright 2013 EMC Corporation. All rights reserved.

32
Design principles
 Real-time analytics

– Low latency MapReduce data processing

 Interactive analytics

– SQL query interface and visualization

 Deep analytics

– Advanced and complex statistical and data mining
– Predictive analytics

 Self-service analytics

– Analytics as a service

© Copyright 2013 EMC Corporation. All rights reserved.

33
Building Blocks in LUMOS
 Data Process

– BoltMR: Flexible and High Performance MapReduce
execution engine

 Data Access

– SQL2MR: Declarative query interface and optimizer for
MapReduce

 Data Service

– DMaaS: Data mining analytics service and tools

© Copyright 2013 EMC Corporation. All rights reserved.

34
Bolt MR
 A flexible, low-latency and high
performance MapReduce
implementation

– Improve the overall performance
– Reduce latency
– Supporting for alternative work load
types
▪ Iterative
▪ Incremental
▪ Online Aggregation and Continuous Query
Flickr credit: http://www.flickr.com/photos/blahflowers/4656725185/

© Copyright 2013 EMC Corporation. All rights reserved.

35
Bolt MR – latency enhancement
 Batch mode MapReduce
with enhancement on
Hadoop:
– Enhanced task resource
allocation
– Master/Worker Pool
– Flexible data
processing/transmission
options

© Copyright 2013 EMC Corporation. All rights reserved.

36
Bolt MR – Performance evaluation
Job Execution time (s)

• On Container Reuse and Worker
Pool

• Lower latency is observed in all the
conducted micro-benchmarks
• For the jobs with small input, substantial
improvement ratio is observed

Reuse + Pool

Worker Pool

Container Reuse

Normal

32
Job3

63
209
242

4000

3500

3500

3000

3000

2500

2500

2000
2000

1500
1500

1000
1000

500
500

0
5

9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121
TaskInitializingTime

© Copyright 2013 EMC Corporation. All rights reserved.

TaskProcessingTime

0

1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
52
55
58
61
64
67
70
73
76
79
82
85
88
91
94
97
100
103
106
109
112
115
118

1

TaskInitializingTime

TaskProcessingTime

37
SQL2MR
 Problems

– Poor programmability and metadata mgmt of MapReduce
▪ MapReduce application is hard to program
▪ Need to publish data in well-known schemas

– Poor performance of existing MapReduce query translation
systems (e.g., Hive, Pig)

▪ Inefficiency (latency on the order of minutes) of sub-optimal MR jobs due
to limited query optimization ability
▪ Poor SQL compatibility and limit language expression power

 Our solution

– An extensible and powerful SQL-like query language for
complex analytics
– Cost-based query execution plan optimization for MR

© Copyright 2013 EMC Corporation. All rights reserved.

38
Query Optimization for MapReduceBased Big Data Analytics
Enumerate the alternative
physical plans (i.e., MR jobs)
for the input query

SQL Queries

Plan Space
Exploration

SQL
query
SQL Query
Processor

A novel cost-based optimization framework
that
Learns from the wide spectrum of DB
query optimization (>40 years!!)
Exploits usage & design properties of
MapReduce frameworks

J
1

J
J
2

J

3

4

Query
Parsing

Estimate the execution costs
of physical plans and select
the cheapest one

Cost
Estimation

Optimal
MR jobs

Schema Info & Statistics
Maintenance

Store and derive the logical and
physical properties of both input
and intermediate data

Efficient MapReduce jobs non-invasively
running at existing and future Hadoop
stacks

© Copyright 2013 EMC Corporation. All rights reserved.

39
Optimizations from other
research/engineering efforts
 Delay Scheduling

– A scheduler that takes into account both fairness and data locality

 Longest Approximate Time to End, LATE

– Speculatively execute task based on finish time estimation
– Launch speculative task on a fast node

 Direct I/O

– Read data from local disk if applicable, avoiding inter-process communication costs
from HDFS

 Low level optimizations

– OS level: Efficient data transfer with sendfile system call
– Instruction level: Increased HDFS read/write efficiency via CRC32 support from
SSE4.2 instruction extensions in Intel Nehalem processor

© Copyright 2013 EMC Corporation. All rights reserved.

40
Quick summary
• Latency improvement - optimization cross all layers in MapReduce system
– Query engine
– SQL query optimization (Tenzing, Stinger, SQL2MR)
– Code generation (Tenzing)
– Architectural design
– Pipelining (HOP)
– Programming model
– Streaming (S4)
– Resource scheduling
– Scheduling algorithm optimization (Delay Scheduling, LATE)
– Data processing and transmission
– In-Memory (S4, Spark), Process Pool (Tenzing, Bolt MR), Sort Avoidance
(Tenzing), more efficient system call, etc
– Data persistence
– Columnar storage (Stinger), Direct I/O
© Copyright 2013 EMC Corporation. All rights reserved.

41
3 Ways to Cope with Latency Lags
Bandwidth

 “3 Ways to Cope with Latency Lags Bandwidth”, from David
Patterson
– Caching

▪ Processor caches, file cache, disk cache

– Replication

▪ Multiple requests to multiple copies and
just use the quickest reply

– Prediction

▪ Branches + Prefetching

 Corresponding latency decreasing approach in MapReduce
– In-memory cache in Spark
– Speculative execution in MapReduce
– Pipelining in HOP

© Copyright 2013 EMC Corporation. All rights reserved.

42
Are We There Yet?
 Identifying performance bottlenecks, is
an iterative process
– Performance impact mitigation on one
bottleneck can be followed by the
discovery of the next one
– “These 3 already fully deployed, so must
find next set of tricks to cope; hard!”
- David Patterson

© Copyright 2013 EMC Corporation. All rights reserved.

43
Taming Latency: Case Studies in MapReduce Data Analytics

More Related Content

What's hot

Implementation of FPGA based Memory Controller for DDR2 SDRAM
Implementation of FPGA based Memory Controller for DDR2 SDRAMImplementation of FPGA based Memory Controller for DDR2 SDRAM
Implementation of FPGA based Memory Controller for DDR2 SDRAMIRJET Journal
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysisRadhegovind
 
Register renaming technique
Register renaming techniqueRegister renaming technique
Register renaming techniqueJinto George
 
Chap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewChap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewSethCopeland
 
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...IRJET Journal
 
Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Jinto George
 
2 colin walls - how to measure rtos performance
2    colin walls - how to measure rtos performance2    colin walls - how to measure rtos performance
2 colin walls - how to measure rtos performanceIevgenii Katsan
 
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...IRJET Journal
 
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the mythsDB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the mythsFlorence Dubois
 
Ibm db2 analytics accelerator high availability and disaster recovery
Ibm db2 analytics accelerator  high availability and disaster recoveryIbm db2 analytics accelerator  high availability and disaster recovery
Ibm db2 analytics accelerator high availability and disaster recoverybupbechanhgmail
 
Curriculam Vitea_Hisham Zahoor
Curriculam Vitea_Hisham ZahoorCurriculam Vitea_Hisham Zahoor
Curriculam Vitea_Hisham ZahoorHisham Zahoor
 
DB2 for z/O S Data Sharing
DB2 for z/O S  Data  SharingDB2 for z/O S  Data  Sharing
DB2 for z/O S Data SharingSurekha Parekh
 

What's hot (14)

Implementation of FPGA based Memory Controller for DDR2 SDRAM
Implementation of FPGA based Memory Controller for DDR2 SDRAMImplementation of FPGA based Memory Controller for DDR2 SDRAM
Implementation of FPGA based Memory Controller for DDR2 SDRAM
 
0507036
05070360507036
0507036
 
Instruction level power analysis
Instruction level power analysisInstruction level power analysis
Instruction level power analysis
 
Link_NwkingforDevOps
Link_NwkingforDevOpsLink_NwkingforDevOps
Link_NwkingforDevOps
 
Register renaming technique
Register renaming techniqueRegister renaming technique
Register renaming technique
 
Chap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software OverviewChap2 - ADSP 21K Manual - Processor and Software Overview
Chap2 - ADSP 21K Manual - Processor and Software Overview
 
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
 
Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.Efficient register renaming and recovery for high-performance processors.
Efficient register renaming and recovery for high-performance processors.
 
2 colin walls - how to measure rtos performance
2    colin walls - how to measure rtos performance2    colin walls - how to measure rtos performance
2 colin walls - how to measure rtos performance
 
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
 
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the mythsDB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
 
Ibm db2 analytics accelerator high availability and disaster recovery
Ibm db2 analytics accelerator  high availability and disaster recoveryIbm db2 analytics accelerator  high availability and disaster recovery
Ibm db2 analytics accelerator high availability and disaster recovery
 
Curriculam Vitea_Hisham Zahoor
Curriculam Vitea_Hisham ZahoorCurriculam Vitea_Hisham Zahoor
Curriculam Vitea_Hisham Zahoor
 
DB2 for z/O S Data Sharing
DB2 for z/O S  Data  SharingDB2 for z/O S  Data  Sharing
DB2 for z/O S Data Sharing
 

Similar to Taming Latency: Case Studies in MapReduce Data Analytics

The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and HadoopMichael Zhang
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep DivesRush Shah
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5RojaT4
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Memory-Driven Near-Data Acceleration and its application to DOME/SKA
 Memory-Driven Near-Data Acceleration and its application to DOME/SKA Memory-Driven Near-Data Acceleration and its application to DOME/SKA
Memory-Driven Near-Data Acceleration and its application to DOME/SKAinside-BigData.com
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesMurtadha Alsabbagh
 
BigData Clusters Redefined
BigData Clusters RedefinedBigData Clusters Redefined
BigData Clusters RedefinedDataWorks Summit
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...EMC
 
Software Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageSoftware Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageEMC
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technicalsolarisyougood
 
Database Performance With Proxy Architectures
Database  Performance With  Proxy  ArchitecturesDatabase  Performance With  Proxy  Architectures
Database Performance With Proxy ArchitecturesPerconaPerformance
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing enginebigdatagurus_meetup
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Jimmy Angelakos
 

Similar to Taming Latency: Case Studies in MapReduce Data Analytics (20)

The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
Map reduce
Map reduceMap reduce
Map reduce
 
Netezza Deep Dives
Netezza Deep DivesNetezza Deep Dives
Netezza Deep Dives
 
Hadoop mapreduce and yarn frame work- unit5
Hadoop mapreduce and yarn frame work-  unit5Hadoop mapreduce and yarn frame work-  unit5
Hadoop mapreduce and yarn frame work- unit5
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Memory-Driven Near-Data Acceleration and its application to DOME/SKA
 Memory-Driven Near-Data Acceleration and its application to DOME/SKA Memory-Driven Near-Data Acceleration and its application to DOME/SKA
Memory-Driven Near-Data Acceleration and its application to DOME/SKA
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
 
BigData Clusters Redefined
BigData Clusters RedefinedBigData Clusters Redefined
BigData Clusters Redefined
 
48a tuning
48a tuning48a tuning
48a tuning
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
 
Software Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and StorageSoftware Defined Data Center: The Intersection of Networking and Storage
Software Defined Data Center: The Intersection of Networking and Storage
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
Pro sphere customer technical
Pro sphere customer technicalPro sphere customer technical
Pro sphere customer technical
 
Database Performance With Proxy Architectures
Database  Performance With  Proxy  ArchitecturesDatabase  Performance With  Proxy  Architectures
Database Performance With Proxy Architectures
 
Apache Tez -- A modern processing engine
Apache Tez -- A modern processing engineApache Tez -- A modern processing engine
Apache Tez -- A modern processing engine
 
Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]Slow things down to make them go faster [FOSDEM 2022]
Slow things down to make them go faster [FOSDEM 2022]
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Taming Latency: Case Studies in MapReduce Data Analytics

  • 1. Taming latency: case studies in MapReduce data analytics Simon Tao EMC Labs China Office of the CTO © Copyright 2013 EMC Corporation. All rights reserved. 1
  • 2. Roadmap Information Disclaimer  EMC makes no representation and undertakes no obligations with regard to product planning information, anticipated product characteristics, performance specifications, or anticipated release dates (collectively, “Roadmap Information”).  Roadmap Information is provided by EMC as an accommodation to the recipient solely for purposes of discussion and without intending to be bound thereby.  Roadmap information is EMC Restricted Confidential and is provided under the terms, conditions and restrictions defined in the EMC NonDisclosure Agreement in place with your organization. © Copyright 2013 EMC Corporation. All rights reserved. 2
  • 3. Agenda  Motivation  Combating latency in general  Latency reducing approach for MR  Case studies for MR with focus on low latency  Summary © Copyright 2013 EMC Corporation. All rights reserved. 3
  • 4. Introduction  What this presentation is about – Approaches that improve performance by enhancing existing MapReduce platform – Focus on per-job latency in wall-clock time, among other performance metrics – With case studies from both academia and industry  What it is not about – Performance improvement by manipulating MapReduce framework tuning knobs © Copyright 2013 EMC Corporation. All rights reserved. 4
  • 5. Low latency: Motivations  Faster decision-making – Fraud detection, system monitoring, trending topic identification  Interactivity – Targeted advertising, personalized news feeds, Online recommendations  “pay-as-you-go” services – Economic advantage in “pay-as-you-go” billing model © Copyright 2013 EMC Corporation. All rights reserved. 5
  • 6. Sources of latency Computer science is a thousand layers of abstraction  Latency is everywhere – Hardware Infrastructure: Processors, Memory, Storage I/O, Network I/O – Software Infrastructure: OS Kernel, JVM, Server software – Architectural design and system implementation – Communication protocol: DNS, TCP © Copyright 2013 EMC Corporation. All rights reserved. — One colleague of Shimon Schocken I see latency.... They're everywhere. 6
  • 7. Combating latency by extrapolation  Approach to minimize latency for systems in general – Address every latency bottleneck in system – Minimize it's latency contribution  Apply latency minimizing approach to MapReduce – What are the layers in MapReduce data processing stack? – How can the latency contributions from them be mitigated? © Copyright 2013 EMC Corporation. All rights reserved. 7
  • 8. MapReduce Recap: logical view  Map(k1,v1) → list(k2,v2) – User defined Map function that processes a key/value pair to generate a set of intermediate key/value pairs  Reduce(k2, list (v2)) → list(v2) – Reduce function merges all values associated with the same intermediate key  Simple, yet expressive – Real world applications: Word Count, Distributed Grep, Count of URL Access Frequency , Inverted Index, etc © Copyright 2013 EMC Corporation. All rights reserved. 8
  • 9. MapReduce illustrated Map Shuffle 1 1 1 1 1 1 1 1 1 1 © Copyright 2013 EMC Corporation. All rights reserved. 1 1 Reduce 3 1 1 1 2 33 22 1 1 3 1 1 1 2 9
  • 10. MapReduce Recap: system view  Embarrassingly parallel – Partitioned parallelism in both Map and Reduce phases  Distributed and scalable – Computations distributed across large cluster of commodity machines – Master schedules tasks to workers  Fault tolerant – Reschedule task in case of failure – Materialize task output to disk  Performance Optimized – Combiner function – Locality-aware scheduling – Redundant execution © Copyright 2013 EMC Corporation. All rights reserved. 10
  • 11. MR latency mitigation: a systematic way  Latency improvement opportunities in aspects from the whole MR processing stack – Architectural design ▪ HOP – Programming model ▪ S4 – Resource scheduling ▪ Delay scheduling – Dataflow: processing and transmission ▪ Spark, Tenzing, Bolt MR, etc – Data persistence ▪ Stinger © Copyright 2013 EMC Corporation. All rights reserved. 11
  • 12. Trade-offs Every good quality is noxious if unmixed — Ralph Waldo Emerson  Latency, sometimes at odds with throughput – Speculative execution ▪ Backup executions of “straggler” tasks decrease per-job latency at the expense of cluster throughput  Trade-off between latency and fault tolerance – Naïve pipelining ▪ Direct output transmission from Mapper to Reducer alleviates latency bottleneck, but hurts fault tolerance  Need to preserve other critical system characteristics – Throughput, fault tolerance, scalability… © Copyright 2013 EMC Corporation. All rights reserved. 12
  • 13. Case Studies Approach to mitigate latency, from HOP, Tenzing, S4, Spark, Stinger and LUMOS © Copyright 2013 EMC Corporation. All rights reserved. 13
  • 14. HOP: Hadoop Online Prototype  A pipelining version of Hadoop from UC Berkeley – “MapReduce Online”, NSDI'10 paper – Open sourced with Apache License 2.0  In HOP’s modified MapReduce architecture, intermediate data is pipelined between operators  HOP preserves the programming interfaces and fault tolerance models of previous MapReduce frameworks © Copyright 2013 EMC Corporation. All rights reserved. 14
  • 15. Stock Hadoop: a blocking architecture  Intermediate data produced by each Mapper is pulled by Reducer in its entirety – Simplified fault tolerance ▪ Data output are materialized before consumption – Underutilized resource: ▪ Completely decoupled execution between Mapper and Reducer © Copyright 2013 EMC Corporation. All rights reserved. 15
  • 16. HOP: from blocking to pipelining  HOP offers a modified MapReduce architecture that allows data to be pipelined between operators – Improved system utilization and reduced completion times with increased parallelism – Extends programming model beyond batch processing ▪ Online aggregation — Allows users to see “early returns” from a job as it is being computed ▪ Continuous queries — Enable applications such as event monitoring and stream processing © Copyright 2013 EMC Corporation. All rights reserved. 16
  • 17. Latency decreasing in HOP  Challenge: Latency backfire – Increased job response time resulting from eager pipelining ▪ Eager pipelining prevents use of “combiner” optimization ▪ Reducer may be overloaded by shifted sorting work from Mappers  Solution: Adaptive load moving 1. Buffer the output, with a threshold size in Mapper 2. On filled buffer, apply combiner function, sort and spill output to disk 3. Spill files are pipelined to reduce tasks adaptively ▪ Accumulated spill files may be further merged © Copyright 2013 EMC Corporation. All rights reserved. 17
  • 18. Preserving fault tolerance in HOP  Challenges: – Reducer failure ▪ Make fault tolerance difficult in purely pipelined architecture – Mapper failure ▪ Limit on the reducer’s ability to merge spill files  Solution: – Materialization ▪ The intermediate data are materialized, retaining fault tolerance in Hadoop – Checkpointing ▪ The reached offset in Mapper input split is bookkept ▪ Only Mapper output produced before the offset is merged by Reducer © Copyright 2013 EMC Corporation. All rights reserved. 18
  • 19. Performance evaluation from HOP  Some initial performance results disclose that pipelining can reduce job completion times by up to 25% in some scenarios – Word-count on 10GB input data, 20 map tasks and 20 reduce tasks – CDF of Map and Reduce task completion times for Blocking and Pipelining, respectively – Pipelining reduces total job runtimes by 19.7% © Copyright 2013 EMC Corporation. All rights reserved. 19
  • 20. Tenzing: Hive the Google way  SQL query engine on top of MapReduce for ad hoc data analysis from Google – “Tenzing A SQL Implementation On The MapReduce Framework”, VLDB'11 paper – Featured by: ▪ ▪ ▪ ▪ Strong SQL support Low latency, comparable with parallel databases Highly scalable and reliable, atop MapReduce Support heterogeneous backend storage © Copyright 2013 EMC Corporation. All rights reserved. 20
  • 21. Low latency approaches in Tenzing  MR execution enhancement – Process pool ▪ Master pool ▪ Worker pool – Streaming and In-memory Chaining – Sort Avoidance for certain hash based operators ▪ Block Shuffle – Local Execution  SQL Query enhancement – Metadata-aware query plan optimization – Projection and Filtering, Aggregation, Joins, etc  Experimental Query Engine optimization – LLVM query engine © Copyright 2013 EMC Corporation. All rights reserved. 21
  • 22. Tenzing performance  “Using this approach, we were able to bring down the latency of the execution of a Tenzing query itself to around 7seconds.”  “There are other bottlenecks in the system however, such as computation of map splits, updating the metadata service, …, etc. which means the typical latency varies between 10 and 20 seconds currently.” © Copyright 2013 EMC Corporation. All rights reserved. 22
  • 23. S4: Simple Scalable Streaming System  A research project for stream processing in Yahoo! – Open sourced in Sep, 2009 and entered Apache Incubation Oct 2011 – A general-purpose stream processing engine ▪ With a simple programming interface ▪ Distributed and scalable ▪ Partially fault-tolerant – Design for use cases different from batch model processing ▪ Infinite data stream ▪ Stream of events that flow into the system at variety data rate ▪ Real-time processing with low latency expected © Copyright 2013 EMC Corporation. All rights reserved. 23
  • 24. S4 overview  Data abstraction – Data are streams of key-value, dispatched and processed by Processing Elements  Design inspired by – Actors model – MapReduce model ▪ key-value based data dispatching © Copyright 2013 EMC Corporation. All rights reserved. TopK, stream processing 24
  • 25. Low latency design in S4  Simple programming paradigm that operates on data streams in real-time  Minimize latency by using local memory to avoid disk I/O bottlenecks – Lossy failover: Partially fault tolerant  Pluggable architecture to select network protocol for data communication – Communication layer allows data be sent without a guarantee in trade for performance © Copyright 2013 EMC Corporation. All rights reserved. 25
  • 26. Spark  Research project at UC Berkeley on big data analytics – “Spark: Cluster Computing with Working Sets”, HotCloud'10  A parallel cluster computing framework – Supports applications with working sets ▪ Iterative algorithm ▪ Interactive data analysis – Retaining the scalability and fault tolerance of MapReduce  Allow interactive large data analyzing on clusters efficiently, with a general purpose programming language © Copyright 2013 EMC Corporation. All rights reserved. 26
  • 27. Latency decreasing in Spark  In Spark, data can be cached in memory explicitly – The core data abstraction for Spark is RDD, the readonly, partitioned collection of objects  Keeping working set of data in memory can improve performance by an order of magnitude – Outperform Hadoop by 20 for iterative jobs – Can be used interactively to search a 1 TB dataset with latencies of 5–7 seconds © Copyright 2013 EMC Corporation. All rights reserved. 27
  • 28. Fault tolerance in Spark  Lineage – Lost partitions are recovered by ‘replaying’ the series of transformations used to build the RDD  Checkpointing – To avoid time-consuming recovery, checkpoint to stable storage will be helpful to applications with ▪ Long lineage graph ▪ Lineage composed of wide dependencies © Copyright 2013 EMC Corporation. All rights reserved. 28
  • 29. Stinger Initiative  Enhance Hive with more SQL and improved performance to allow human-time use cases – Announced in Feb 2013, led by Hortonworks – Effort from community collaboration, with resources from SAP, Microsoft, Facebook and Hortonworks © Copyright 2013 EMC Corporation. All rights reserved. 29
  • 30. Making Apache Hive 100 Times Faster  Stinger’s improvements on HIVE – More SQL ▪ Analytics features, standard SQ aligning, etc – Optimized query execution plans ▪ 45X performance increase for Hive in some early results – Support of new columnar file format ▪ ORCFile, more efficiency and high performance – New runtime framework, Apache Tez © Copyright 2013 EMC Corporation. All rights reserved. 30
  • 31. Accelerating data processing by Tez  In traditional MapReduce, one SQL query often results in multiple jobs, which eventually impacts performance – Latency introduced from launching of jobs – Extra overhead in materializing intermediate job outputs to the file system  Performance improvements from Tez – With a generalized computing paradigm for DAG execution, Tez can express any SQL as one single job – Tez AM, running atop YARN, supports container reuse © Copyright 2013 EMC Corporation. All rights reserved. 31
  • 32. LUMOS Project  A real-time, interactive, self-service data cloud platform for big data analytics, from EMC Labs China  LUMOS – guide the data scientists to the big value of big data  Goal: Develop key building blocks for the big data cloud platform © Copyright 2013 EMC Corporation. All rights reserved. 32
  • 33. Design principles  Real-time analytics – Low latency MapReduce data processing  Interactive analytics – SQL query interface and visualization  Deep analytics – Advanced and complex statistical and data mining – Predictive analytics  Self-service analytics – Analytics as a service © Copyright 2013 EMC Corporation. All rights reserved. 33
  • 34. Building Blocks in LUMOS  Data Process – BoltMR: Flexible and High Performance MapReduce execution engine  Data Access – SQL2MR: Declarative query interface and optimizer for MapReduce  Data Service – DMaaS: Data mining analytics service and tools © Copyright 2013 EMC Corporation. All rights reserved. 34
  • 35. Bolt MR  A flexible, low-latency and high performance MapReduce implementation – Improve the overall performance – Reduce latency – Supporting for alternative work load types ▪ Iterative ▪ Incremental ▪ Online Aggregation and Continuous Query Flickr credit: http://www.flickr.com/photos/blahflowers/4656725185/ © Copyright 2013 EMC Corporation. All rights reserved. 35
  • 36. Bolt MR – latency enhancement  Batch mode MapReduce with enhancement on Hadoop: – Enhanced task resource allocation – Master/Worker Pool – Flexible data processing/transmission options © Copyright 2013 EMC Corporation. All rights reserved. 36
  • 37. Bolt MR – Performance evaluation Job Execution time (s) • On Container Reuse and Worker Pool • Lower latency is observed in all the conducted micro-benchmarks • For the jobs with small input, substantial improvement ratio is observed Reuse + Pool Worker Pool Container Reuse Normal 32 Job3 63 209 242 4000 3500 3500 3000 3000 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 105 109 113 117 121 TaskInitializingTime © Copyright 2013 EMC Corporation. All rights reserved. TaskProcessingTime 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100 103 106 109 112 115 118 1 TaskInitializingTime TaskProcessingTime 37
  • 38. SQL2MR  Problems – Poor programmability and metadata mgmt of MapReduce ▪ MapReduce application is hard to program ▪ Need to publish data in well-known schemas – Poor performance of existing MapReduce query translation systems (e.g., Hive, Pig) ▪ Inefficiency (latency on the order of minutes) of sub-optimal MR jobs due to limited query optimization ability ▪ Poor SQL compatibility and limit language expression power  Our solution – An extensible and powerful SQL-like query language for complex analytics – Cost-based query execution plan optimization for MR © Copyright 2013 EMC Corporation. All rights reserved. 38
  • 39. Query Optimization for MapReduceBased Big Data Analytics Enumerate the alternative physical plans (i.e., MR jobs) for the input query SQL Queries Plan Space Exploration SQL query SQL Query Processor A novel cost-based optimization framework that Learns from the wide spectrum of DB query optimization (>40 years!!) Exploits usage & design properties of MapReduce frameworks J 1 J J 2 J 3 4 Query Parsing Estimate the execution costs of physical plans and select the cheapest one Cost Estimation Optimal MR jobs Schema Info & Statistics Maintenance Store and derive the logical and physical properties of both input and intermediate data Efficient MapReduce jobs non-invasively running at existing and future Hadoop stacks © Copyright 2013 EMC Corporation. All rights reserved. 39
  • 40. Optimizations from other research/engineering efforts  Delay Scheduling – A scheduler that takes into account both fairness and data locality  Longest Approximate Time to End, LATE – Speculatively execute task based on finish time estimation – Launch speculative task on a fast node  Direct I/O – Read data from local disk if applicable, avoiding inter-process communication costs from HDFS  Low level optimizations – OS level: Efficient data transfer with sendfile system call – Instruction level: Increased HDFS read/write efficiency via CRC32 support from SSE4.2 instruction extensions in Intel Nehalem processor © Copyright 2013 EMC Corporation. All rights reserved. 40
  • 41. Quick summary • Latency improvement - optimization cross all layers in MapReduce system – Query engine – SQL query optimization (Tenzing, Stinger, SQL2MR) – Code generation (Tenzing) – Architectural design – Pipelining (HOP) – Programming model – Streaming (S4) – Resource scheduling – Scheduling algorithm optimization (Delay Scheduling, LATE) – Data processing and transmission – In-Memory (S4, Spark), Process Pool (Tenzing, Bolt MR), Sort Avoidance (Tenzing), more efficient system call, etc – Data persistence – Columnar storage (Stinger), Direct I/O © Copyright 2013 EMC Corporation. All rights reserved. 41
  • 42. 3 Ways to Cope with Latency Lags Bandwidth  “3 Ways to Cope with Latency Lags Bandwidth”, from David Patterson – Caching ▪ Processor caches, file cache, disk cache – Replication ▪ Multiple requests to multiple copies and just use the quickest reply – Prediction ▪ Branches + Prefetching  Corresponding latency decreasing approach in MapReduce – In-memory cache in Spark – Speculative execution in MapReduce – Pipelining in HOP © Copyright 2013 EMC Corporation. All rights reserved. 42
  • 43. Are We There Yet?  Identifying performance bottlenecks, is an iterative process – Performance impact mitigation on one bottleneck can be followed by the discovery of the next one – “These 3 already fully deployed, so must find next set of tricks to cope; hard!” - David Patterson © Copyright 2013 EMC Corporation. All rights reserved. 43