Distributed Processing Frameworks

•Transferir como PPTX, PDF•

1 gostou•810 visualizações

Antonios Katsarakis

An overview of Map-reduce and Spark papers

Software

Distributed Processing
Frameworks
Author: Antonios Katsarakis

Literature
• MapReduce: Simplified Data Processing on Large Clusters
Jeff Dean et al. - OSDI’04.
• Spark: Cluster Computing with Working Sets
M. Zaharia et al. - HotCloud’10.

Why Big Data?
• More data to process: IoT, smart devices, web applications
- About 2.3 trillion GB of new data are generated every day
• Growth of CPU performance cannot keep up with increasing
amount of data to process
• This leads us to the Big Data era
- Big data: Data sets are so large that the processing power of a
single machine is inadequate to deal with them
• We need to find ways to process these massive amounts of data

MapReduce
• Proposed by Jeff Dean et al. (Google) 2004
- Cited more than 18k
• A programming model that enables the parallel
and distributed processing of large data sets
• Typical MapReduce Program:
- Read Data
- Map: filtering of the data
- Shuffle and short
- Reduce: summary operation on data
- Write the Results
ReduceReduce
Input Data
1/3
Input
1/3
Input
1/3
Input
Map Map Map
Interm.
Data
Interm.
Data
Interm.
Data
Output
Data
Output
Data

Critical Reflection
• Outcome:
- Novel idea that lead to a whole new era of distributed systems
- Big impact in industry (Hadoop MapReduce)
- Lowered the cost of computations
• Limitations:
- Restricted to batch processing
- It only support map and reduce operations
- The shuffling phase introduces overheads

Spark
• Proposed by Matei Zaharia et al. 2010
- Cited 1.5k
• Another programming model based on
higher-ordered functions that execute
user-defined functions in parallel
• Aims to replace MapReduce in industry
• Main Ideas:
- Represent the computations as DAGs
- Cache datasets into memory

Spark Model
• Resilient Distributed
Datasets (RRDs):
immutable collections of
objects spread across a
cluster
• Operations over RDDs:
1.Transformations: lazy
operators that create new
RDDs
2.Actions: launch a
computation on an RDD
Pipelined
RDD1
var count = readFile(…)
.map(…)
.filter(..)
.reduceByKey()
.count()
File splited
into chunks
(RDD0)
RDD2
RDD3
RDD4
Result
Job (RDD) Graph
Stage1St.2

Critical Reflection
• Benefits:
- High level API
- Support more applications types
- Performance optimizations
• Limitations:
- Detailed performance analysis on the thread level is hard
- Multipurpose application support makes performance improvements and
tuning really challenging
- The shuffling phase introduces overheads

Conclusion
• Clusters provide the computational power to
process Big Data
• MapReduce allows developers to build programs for
clusters
• Spark tries to overcome limitations of MapReduce
• These systems introduce many challenges in terms
of measuring and improving their performance

Mais conteúdo relacionado

Mais procurados

High performance computing with accelaratorsEmmanuel college

Hadoop mapreduce performance study on arm clusterairbots

High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju

APSys Presentation Final copy2Junli Gu

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015Junli Gu

GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scalesparktc

OpenCL caffe IWOCL 2016 presentation finalJunli Gu

Modern processor artwaqasjadoon11

Danish presentationwaqasjadoon11

High performance computing - building blocks, production & perspectiveJason Shih

Взгляд на облака с точки зрения HPCOlga Lavrentieva

Lec04 gpu architectureTaras Zakharchenko

Greenplum-Spark November 2018KongYew Chan, MBA

Optimizing High Performance Computing Applications for EnergyDavid Lecomber

MapReduce and HadoopNicola Cadenelli

Advanced Hadoop Tuning and Optimization Shivkumar Babshetty

Exascale CapablSagar Dolas

Map Reduceopenak

Mais procurados (19)

High performance computing with accelarators

Hadoop mapreduce performance study on arm cluster

High performance computing tutorial, with checklist and tips to optimize clus...

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

APSys Presentation Final copy2

Big_Data_Heterogeneous_Programming IEEE_Big_Data 2015

GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale

OpenCL caffe IWOCL 2016 presentation final

Modern processor art

Danish presentation

High performance computing - building blocks, production & perspective

Взгляд на облака с точки зрения HPC

Lec04 gpu architecture

Greenplum-Spark November 2018

Optimizing High Performance Computing Applications for Energy

MapReduce and Hadoop

Advanced Hadoop Tuning and Optimization

Exascale Capabl

Map Reduce

Semelhante a Distributed Processing Frameworks

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event

Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko

Big Data ProcessingMichael Ming Lei

Introduction to Hadoop and MapReduceCsaba Toth

tryLamha Agarwal

11. From Hadoop to Spark 1:2Fabio Fumarola

Hadoop MapReduce ParadigmTarjMehta1

In memory grids IMDGPrateek Jain

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

Spark Driven Big Data Analyticsinoshg

Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk

A Survey on Big Data Analysis Techniquesijsrd.com

Big Data Architecture and DeploymentCisco Canada

Cisco connect toronto 2015 big data sean mc keownCisco Canada

Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha

HadoopRamakrishna Reddy Bijjam

Introduction to Apache HadoopChristopher Pezza

What is Distributed Computing, Why we use Apache SparkAndy Petrella

Analysing of big data using map reducePaladion Networks

Taboola Road To Scale With Apache Sparktsliwowicz

Semelhante a Distributed Processing Frameworks (20)

Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"

Distributed Computing with Apache Hadoop: Technology Overview

Big Data Processing

Introduction to Hadoop and MapReduce

try

11. From Hadoop to Spark 1:2

Hadoop MapReduce Paradigm

In memory grids IMDG

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

Spark Driven Big Data Analytics

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk

A Survey on Big Data Analysis Techniques

Big Data Architecture and Deployment

Cisco connect toronto 2015 big data sean mc keown

Apache Tez : Accelerating Hadoop Query Processing

Hadoop

Introduction to Apache Hadoop

What is Distributed Computing, Why we use Apache Spark

Analysing of big data using map reduce

Taboola Road To Scale With Apache Spark

Mais de Antonios Katsarakis

The L2AW theoremAntonios Katsarakis

Invalidation-Based Protocols for Replicated DatastoresAntonios Katsarakis

Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]Antonios Katsarakis

Hermes Reliable Replication Protocol - Poster Antonios Katsarakis

Hermes Reliable Replication Protocol - ASPLOS'20 PresentationAntonios Katsarakis

Scale-out ccNUMA - Eurosys'18Antonios Katsarakis

Mais de Antonios Katsarakis (6)

The L2AW theorem

Invalidation-Based Protocols for Replicated Datastores

Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]

Hermes Reliable Replication Protocol - Poster

Hermes Reliable Replication Protocol - ASPLOS'20 Presentation

Scale-out ccNUMA - Eurosys'18

Último

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale

Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd

%in kempton park+277-882-255-28 abortion pills for sale in kempton park masabamasaba

Exploring the Best Video Editing App.pdfproinshot.com

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...Shane Coughlan

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab

Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171

10 Trends Likely to Shape Enterprise Technology in 2024Mind IT Systems

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba

%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba

Direct Style Effect Systems -The Print[A] Example- A Comprehension AidPhilip Schwarz

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba

TECUNIQUE: Success Stories: IT Service providermohitmore19

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

%in Midrand+277-882-255-28 abortion pills for sale in midrandmasabamasaba

Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...Nitya salvi

%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba

Distributed Processing Frameworks

1. Distributed Processing Frameworks Author: Antonios Katsarakis

2. Literature • MapReduce: Simplified Data Processing on Large Clusters Jeff Dean et al. - OSDI’04. • Spark: Cluster Computing with Working Sets M. Zaharia et al. - HotCloud’10.

3. Why Big Data? • More data to process: IoT, smart devices, web applications - About 2.3 trillion GB of new data are generated every day • Growth of CPU performance cannot keep up with increasing amount of data to process • This leads us to the Big Data era - Big data: Data sets are so large that the processing power of a single machine is inadequate to deal with them • We need to find ways to process these massive amounts of data

4. MapReduce • Proposed by Jeff Dean et al. (Google) 2004 - Cited more than 18k • A programming model that enables the parallel and distributed processing of large data sets • Typical MapReduce Program: - Read Data - Map: filtering of the data - Shuffle and short - Reduce: summary operation on data - Write the Results ReduceReduce Input Data 1/3 Input 1/3 Input 1/3 Input Map Map Map Interm. Data Interm. Data Interm. Data Output Data Output Data

5. Critical Reflection • Outcome: - Novel idea that lead to a whole new era of distributed systems - Big impact in industry (Hadoop MapReduce) - Lowered the cost of computations • Limitations: - Restricted to batch processing - It only support map and reduce operations - The shuffling phase introduces overheads

6. Spark • Proposed by Matei Zaharia et al. 2010 - Cited 1.5k • Another programming model based on higher-ordered functions that execute user-defined functions in parallel • Aims to replace MapReduce in industry • Main Ideas: - Represent the computations as DAGs - Cache datasets into memory

7. Spark Model • Resilient Distributed Datasets (RRDs): immutable collections of objects spread across a cluster • Operations over RDDs: 1.Transformations: lazy operators that create new RDDs 2.Actions: launch a computation on an RDD Pipelined RDD1 var count = readFile(…) .map(…) .filter(..) .reduceByKey() .count() File splited into chunks (RDD0) RDD2 RDD3 RDD4 Result Job (RDD) Graph Stage1St.2

8. Critical Reflection • Benefits: - High level API - Support more applications types - Performance optimizations • Limitations: - Detailed performance analysis on the thread level is hard - Multipurpose application support makes performance improvements and tuning really challenging - The shuffling phase introduces overheads

9. Conclusion • Clusters provide the computational power to process Big Data • MapReduce allows developers to build programs for clusters • Spark tries to overcome limitations of MapReduce • These systems introduce many challenges in terms of measuring and improving their performance

Notas do Editor

HL API - (in Scala, Java, Python) - usable by non computer scientists SMAT - (streaming, iterative and interactive) PO - (memory caching, transformation pipelining etc.)
3* (in terms of performance, application support and user friendliness)

Distributed Processing Frameworks

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Distributed Processing Frameworks

Semelhante a Distributed Processing Frameworks (20)

Mais de Antonios Katsarakis

Mais de Antonios Katsarakis (6)

Último

Último (20)

Distributed Processing Frameworks

Notas do Editor